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SUMMARY.  An  algorithm  is  proposed  for  solving  the  stereoscopic 
matching  problem.  The  algorithm  consists  of  five  steps:  (1)  Each 
image  is  filtered  with  bar  masks  of  four  sizes  that  vary  with 
eccentricity;  the  equivalent  filters  are  about  one  octave  wide.  (2) 
Zero-crossings  of  the  mask  values  are  localized,  and  positions  that 
correspond  to  terminations  are  found;  (3)  For  each  mask  size,  matching 
takes  place  between  pairs  of  zero-crossings  or  terminations  of  the  same 
sign  in  the  two  images,  for  a range  of  disparities  up  to  about  the 
width  of  the  mask's  central  region;  (4)  Wide  masks  can  control 
vergence  movements,  thus  causing  small  masks  to  come  into 
correspondence;  (5)  When  a correspondence  is  achieved,  it  is  written 
into  a dynamic  buffer,  called  the  2-j-D  sketch. 

It  is  shown  that  this  proposal  provides  a theoretical  framework 
for  most  existing  psychophysical  and  neurophysiological  data  about 
stereopsis.  Several  critical  experimental  predictions  are  also  made, 
for  instance  about  the  size  of  Panum's  area  under  various  conditions. 
The  results  of  such  experiments  would  tell  us  whether,  for  example, 
cooperativity  is  necessary  for  the  fusion  process. 
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O Introduction 

In  a recent  article,  Marr  5 Poggio  (1976)  analyzed  the 
computational  structure  of  the  stereo  correspondence  problem  for  stereo 
vision,  and  derived  a cooperative  algorithm  for  extracting  disparity 
information  from  stereo  image  pairs.  Although  the  problem  addressed 
there  was  not  directly  related  to  the  question  of  how  our  brains 
extract  disparity  information,  the  algorithm  they  described,  summarized 
here  in  figure  2,  has  a natural  interpretation  in  terms  of  neural 
structures. 

One  characteristic  of  this  algorithm  is  its  lack  of  dependence  on 
eye-movements,  so  a critical  preliminary  question  for  its  relevance  to 
biology  concerns  the  relative  importance  of  neural  fusion  and  of  eye- 
movements  for  stereopsis  (Marr  6 Poggio  1976).  Various  kinds  of 
evidence  suggest  that  eye-movements  play  an  Important  role  in  stereo- 
vision, suggesting  that  a rather  different  kind  of  algorithm  may  be 
involved  in  human  stereopsis. 

In  this  article,  we  review  the  computational  structure  of  the 
stereo  disparity  problem,  and  briefly  outline  existing  approaches  to 
solving  it.  We  then  rt  ;iew  the  available  neurophysiological  and 
psychophysiological  evidence,  and  point  to  some  of  the  empirical 
questions  left  unresolved  in  the  literature.  Finally,  we  formulate  an 
algorithm  designed  specifically  as  a theory  of  the  matching  process  in 
human  stereopsis,  and  pre  ent  a theoretical  framework  for  the  overall 


Human  stereopsis 


3 


Marr  ( Poggio 


computational  problem  of  stereopsis.  We  show  that  our  theory  accounts 
for  most  of  the  available  evidence,  formulate  the  predictions  to  which 
it  leads,  and  describe  some  critical  experiments.  A computer 
implementation  of  the  algorithm,  and  the  results  of  some  of  these 
experiments,  are  described  by  Grimson  & Marr  (1978),  and  by  Richards  <» 
Marr  (1978). 
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1 Computational  structure  of  the  stereo-disparity  problem 

Because  of  the  way  our  eyes  are  positioned  and  controlled,  our 
brains  usually  receive  similar  images  of  a scene  taken  from  two  nearby 
points  at  the  same  horizontal  level.  If  two  objects  are  separated  In 
depth  from  the  viewer,  the  relative  positions  of  their  images  will 
differ  in  the  two  eyes.  Our  brains  are  capable  of  measuring  this 
disparity  and  of  using  it  to  estimate  depth. 

Three  steps  (S)  are  involved  in  measuring  stereo  disparity:  (SI) 
a particular  location  on  a surface  in  the  scene  must  be  selected  from 
one  image;  (S2)  that  same  location  must  be  identified  in  the  other 
image;  and  (S3)  the  disparity  in  the  two  corresponding  image  points 
must  be  measured. 

If  one  could  identify  a location  beyond  doubt  in  the  two  images, 
for  example  by  illuminating  it  with  a spot  of  light,  steps  SI  and  S2 
could  be  avoided  and  the  problem  would  be  easy.  In  practice  one  cannot 
do  this  (figure  1),  and  the  difficult  part  of  the  computation  is 
solving  the  correspondence  problem.  Julesz  found  that  we  are  able  to 
interpret  random-dot  stereograms,  which  are  stereo  pairs  that  consist 
of  random  dots  when  viewed  monocularly  but  fuse  when  viewed 
stereoscopical .y  to  yield  patterns  separated  in  depth.  This  might  be 
thought  rurprising,  because  when  one  tries  to  set  up  a correspondence 
between  two  arrays  of  random  dots,  false  targets  arise  in  profusion 
(figure  1).  Fven  so  and  in  the  absence  of  any  monocular  or  high  level 
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cues,  we  are  able  to  determine  the  correct  correspondence. 

In  order  to  formulate  the  correspondence  computation  precisely,  we 
have  to  examine  its  basis  in  the  physical  world.  Two  constraints  (C) 
of  importance  may  be  identified  (Marr  1974):  (Cl)  a given  point  on  a 
physical  surface  has  a unique  position  in  space  at  any  one  time:  and 
(C2)  matter  is  cohesive,  it  is  separated  into  objects,  and  the  surfaces 
of  objects  are  generally  smooth  compared  with  their  distance  from  the 
viewer. 

These  constraints  apply  to  locations  on  a physical  surface. 
Therefore,  when  we  translate  them  into  conditions  on  a computation  we 
must  ensure  that  the  items  to  which  they  apply  in  the  image  are  in  one- 
to-one  correspondence  with  well-defined  locations  on  a physical 
surface.  To  do  this,  one  must  use  image  predicates  that  correspond  to 
surface  markings,  discontinuities  in  the  visible  surfaces,  shadows,  and 
so  forth,  which  in  turn  means  using  predicates  that  correspond  to 
changes  in  intensity.  One  solution  is  to  obtain  a primitive 
description  of  the  intensity  changes  present  in  each  image,  like  the 
primal  sketch  (Marr  1976),  and  then  to  match  these  descriptions.  Line 
and  edge  segments,  blobs,  termination  points,  and  tokens,  obtained  from 
these  by  grouping,  usually  correspond  to  items  that  have  a physical 
existence  on  a surface. 

The  stereo  problem  may  thus  be  reduced  to  that  of  matching  two 
primitive  symbolic  descriptions,  one  from  each  eye.  One  can  think  of 
the  elements  of  these  descriptions  as  carrying  only  position 
information,  like  the  black  dots  in  a random-dot  stereogram,  although 


1.  Ambiguity  in  the  correspondence  between  the  two  retinal 
projections.  In  this  figure,  each  of  the  four  points  in  one  eye's  view 
could  match  any  of  the  four  projections  in  the  other  eye's  view.  Of 
the  16  possible  matchings  only  four  are  correct  (filled  circles),  while 
the  remaining  12  are  "false  targets"  (open  circles).  It  is  assumed 
here  that  the  targets  (filled  squares)  correspond  to  "matchable" 
descriptive  elements  obtained  from  the  left  and  right  images.  Without 
further  constraints  based  on  global  considerations,  such  ambiguities 
cannot  be  resolved.  Redrawn  from  Julesz  (1971,  figure  4.5-1). 
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for  a full  image  there  will  exist  rules  that  specify  which  matches 
between  descriptive  elements  are  possible  and  which  are  not.  The  two 
physical  constraints  Cl  and  C2  can  now  be  translated  into  two  rules  (R) 
for  how  the  left  and  right  descriptions  are  combined: 

(Kl)  Uniqueness.  Each  item  from  each  image  may  be  assigned  at  most 
one  disparity  value.  This  condition  relies  on  the  assumption  that  an 
item  corresponds  to  something  that  has  a unique  physical  position. 

(R2)  Continuity.  Disparity  varies  smoothly  almost  everywhere.  This 
condition  is  a consequence  of  the  cohesiveness  of  matter,  and  it  states 
that  only  a small  fraction  of  the  area  of  an  image  is  composed  of 
boundaries  that  are  discontinuous  in  depth. 

In  practise,  R1  cannot  be  applied  simply  to  grey-level  points  in 
an  image,  because,  a grey-level  point  is  in  only  implicit 
correspondence  with  a physical  location.  It  is  in  fact  impossible  to 
ensure  that  a grey-level  point  in  one  image  corresponds  to  exactly  the 
same  physical  position  as  a grey-level  point  in  the  other.  A sharp 
change  in  intensity,  however,  usually  corresponds  to  a surface  marking, 
and  therefore  defines  a single  physical  position  precisely.  The 
positions  of  such  changes  may  be  detected  by  finding  peaks  in  the  first 
derivative  of  intensity,  or  zero-crossings  in  the  second  derivative. 
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2 Current  approaches  to  the  matching  problem 

One  of  the  most  significant  advances  in  modern  psychophysics  was 
Julesz's  (1960)  invention  of  the  random-dot  stereogram.  In  addition  to 
its  various  applications,  this  invention  posed  one  of  the  few  clear 
problems  for  neurophysiology  and  psychology,  because  it  showed  that 
stereoscopic  fusion  is  a relatively  early  and  independent  computation. 

Julesz's  (1971)  subsequent  studies  have  helped  to  shape  the 
character  of  experimental  and  theoretical  approaches  to  this  problem. 
One  of  his  most  influential  suggestions  has  been  the  notion  that  the 
computation  of  stereo  disparity  depends  on  competing  excitatory  and 
inhibitory  influences  between  nearby  items  with  the  same  and  different 
disparities  (Julesz  1971,  page  220  last  paragraph).  This  suggestion 
arose  out  of  his  belief  that  binocular  fusion  is  a cooperative  process, 
a belief  whose  foundation  we  shall  examine  critically  below.  Apart 
from  AUTOMAP  (Julesz  1962),  all  the  models  we  now  examine  were  attempts 
at  realizing  this  idea. 

AUTONAP  (Julesz  1962)  is  a cluster-seeking  program  that  operates  in 
various  layers  shown  in  figure  2.  Of  the  two  rules  formulated  in  the 
last  section,  it  implements  R2  (continuity)  implicitly  (because  it 
detects  only  clusters),  but  it  fails  to  implement  Rl.  Hence  in  an 
ambiguous  stereogram,  both  o-ganizatlons  will  be  detected 
simul taneously. 
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The  dipole  model  (Julesz  1971,  page  203ff,  and  Julesz  5 Chang  1976). 

Each  position  on  each  retina  is  associated  with  a magnetic  dipole, 
whose  polarity  is  determined  (ir.  the  case  of  random-dot  stereograms)  by 
the  retinal  intensity  value.  Spring  coupling  between  the  tips  of 
adjacent  dipoles  implements  the  continuity  rule  R2.  The  orientation  of 
a dipole  represents  a disparity  value,  and  the  fact  that  each  dipole 
can  have  only  one  orientation  at  a time  provides  an  implementation  of 
the  uniqueness  rule  Rl.  Notice  that  unlike  the  other  models  we  shall 
discuss,  this  one  does  not  represent  explicitly  all  possible  states  of 
figure  2,  since  each  horizontal  or  vertical  line  in  that  figure 
corresponds  to  the  angular  range  in  position  of  a single  dipole.  Hence 
taken  literally,  this  model  would  correspond  to  a scheme  in  which 
disparity  at  each  position  is  signalled  by  the  rate  of  firing  of  a 
single  neuron  and  can  therefore  be  thought  of  as  a one-pool  model.  It 
would  be  interesting  to  see  a computer  implementation  of  such  a model. 

Sperling  (1970).  This  model  is  based  on  correlation  between  two  grey- 
level  images  (his  eq.  [7]  p. 471,  and  see  also  p.483  lines  9-12).  Its 
approach  is  unsatisfactory  for  two  reasons;  firstly,  as  we  have  already 
seen,  grey-levels  are  an  inappropriate  domain  for  the  matching 
function,  and  secondly  the  area  and  disposition  of  the  neighbourhoods 
over  which  the  correlation  is  taken  is  crucial  and  left  unspecified. 
Sperling’s  work  does  however  make  an  interesting  point  of  the  connexion 
between  stereopsis  and  vergence  movements. 


JO 


2.  The  explicit  structure  of  the  two  rules  R1  and  R2  for  the  case  of  a 
one-dimensional  image  is  represented  in  (a),  whl^ii  also  shows  the 
structure  of  a network  for  implementing  the  algorithm  described  by  eq. 

1.  Lx  and  Rx  represent  the  positions  of  descriptive  elements  in  the 
left  and  right  images.  The  continuous  vertical  and  horizontal  lines 
represent  lines  of  sight  from  the  left  and  the  right  eyes.  Their 
intersection  points  correspond  to  possible  disparity  values.  R1  states 
that  only  one  match  is  allowed  along  any  given  horizontal  or  vertical 
line;  R2  states  that  solution  planes  tend  to  spread  along  the  dotted 
diagonal  lines,  which  are  lines  of  constant  disparity. 

In  a network  implementation  of  these  rules,  one  can  place  a "cell" 
at  each  node;  then  solid  lines  represent  "inhibitory"  interactions,  and 
dotted  lines  represent  "excitatory"  ones.  The  local  structure  at  each 
node  of  the  network  in  (a)  is  given  in  (b).  This  algorithm  may  be 
extended  to  two-dimensional  images,  in  which  case  each  node  in  the 
corresponding  network  has  the  local  structure  shown  in  (c).  (From  Marr 
5 Poggio  1976  fig.  2). 
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kelson  (1975)  dilated  on  the  ideas  of  Julesz,  making  two  proposals 
which  could  implement  our  two  rules.  In  terms  of  figure  2,  the 
geometry  of  the  lines  of  inhibition  is  left  unclear,  but  it  probably 
corresponds  more  to  the  inhibition  shown  in  figure  3 than  that  of 
figure  2,  and  so  does  not  precisely  implement  rule  Rl.  Nelson  gave  no 
precise  algorithm,  nor  did  he  implement  any  form  of  his  ideas. 

Dev  (1975)  was  one  of  the  first  to  formulate  a precise  algorithm  that 
attempted  to  embody  Julesz's  ideas  [Dev  1975,  eq.  (1)  & (2),  p.  515]. 

In  terms  of  our  rules,  the  algorithm  realizes  R2,  but  an  incorrect 
version  of  Rl  (see  figure  3).  Dev's  algorithm  is  not  cooperative, 
however,  because  it  is  linear  (her  equations  1 and  2).  Dev  writes  (p. 
526  lines  18-19)  of  applying  a threshold  to  the  results  of  the  linear 
operation,  but  she  did  not  say  how  to  set  such  a threshold 
appropriately,  and  this  is,  of  course,  not  a trivial  problem1. 

Hlrai  & Fukushima  (1976)  constructed  a neural  model  that  correctly 
implemented  the  uniqueness  rule  Rl  [their  function  (1)  p 481,  but  did 
not  implement  rule  R2,  preferring  instead  a network  that  favoured 
solutions  with  lower  parallax.  This  is  an  interesting  idea,  and  a form 
of  it  plays  a role  in  our  theory  (cf.  figure  9). 

Sugie  & Suioa  (197?)  proposed  a new  and  complex  (non-linear  and 
iterative)  model  that  implements  part  of  rule  R2,  but  apparently  uses 


3.  The  uniqueness  rule  R1  gives  rise  to  two  sets  of  inhibitory 
interactions,  along  the  lines  of  sight  from  each  eye,  as  illustrated  In 
figure  2.  Several  of  the  algorithms  that  are  described  in  the  text  use 
inhibitory  connexions  like  those  illustrated  here.  Roughly  speaking, 
these  algorithms  search  for  solutions  that  can  be  regarded  as  a single- 
valued, continuous  vector  field  in  a plane. 
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the  incorrect  (figure  3)  version  of  rule  Rl.  The  presence  of  an  AND 
gate  on  their  BN4  neurons  (their  fig.  4b)  prevents  their  model  from 
exhibiting  "filling-in"  phenomena,  and  calls  into  question  the  exact 
nature  of  the  rule  R2  that  their  network  realizes. 

fforr  & Poggio  (1976)  formulated  the  iterative  algorithm 


r(t+l) 

x,.y;d 


y c(t)  r v r(t)  . r(o) 

^ x',y';d'  L ^ x' ,y' ;d'  Cx,y;d 

x’ ,y»  ,d'eS(x,y,d)  x' ,y' ,d't0(x,y,d) 


where  Cx,y;d  denotes  the  state  of  the  cell  corresponding  to  position 
(x,y),  disparity  d and  time  t in  the  network  of  figure  2;  S(x,y,d)  is  a 
local  excitatory  neighborhood  confined  to  the  same  disparity  layer,  and 
O(x.y.d),  the  inhibitory  neighborhood,  consists  of  cells  lying  on  the 
two  lines  of  sight  (figure  2c).  € is  an  inhibition  constant,  and  a is 

a threshold  function.  The  initial  state  C°  contains  all  possible 
matches,  including  false  targets,  within  the  prescribed  disparity 
range.  The  rules  Rl  and  R2  are  implemented  through  the  geometry  of  the 
inhibitory  and  excitatory  neighborhoods  0 and  S (figure  2c).  This 
algorithm  was  shown  to  solve  random-dot  stereograms  successfully  (Marr 
6 Poggio  1976  figures  3’).  In  a mathematical  analysis  of  the 
algorithm  (Marr,  Palm  6 Poggio  1978),  1*  was  demonstrated  that  states 
obeying  the  two  rule  Rl  and  R2  were  stable  states  of  the  algorithm, 
and  it  was  shown  that,  for  a wide  range  of  parameter  values,  the 
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algorithm  converges. 

The  nature  of  the  input  to  which  the  algorithm  is  applied  was  not 
specified  in  detail.  The  success  of  the  algorithm  was  demonstrated 
only  for  random-dot  stereograms,  where  this  problem  does  not  arise.  In 
addition,  the  algorithm  has  no  dynamics  in  this  form,  and  therefore 
exhibits  no  hysteresis. 
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3 Common  characteristics  of  these  algorithms 

Apart  from  AUTOMAP  and  Sperling  (1970),  all  of  these  algorithms 
are  based  on  Julesz's  proposal  that  fusion  in  human  stereopsis  is  a 
cooperative  process.  An  essential  feature  of  these  algorithms  is  that 
they  are  designed  to  select  correct  matches  in  a situation  where  false 
targets  occur  in  profusion.  That  is,  apart  possibly  from  early 
versions  of  Julesz's  dipole  model,  they  do  not  critically  rely  on  eye 
movements,  since  in  principle,  they  have  the  ability  to  interpret  a 
random-dot  stereogram  without  them. 

At  the  level  of  neurophysiology,  these  algorithms  (with  the 
exception  of  Julesz's  dipole  model,  which  we  discussed  above)  all 
require  many  disparity  "layers".  This  would  imply  (1)  the  existence  of 
many  "disparity-detecting"  neurons,  whose  peak  sensitivities  cover  a 
range  of  disparity  values  that  is  much  wider  than  the  tuning  curves  of 
the  individual  neurons,  and  which  are  rather  insensitive  to  the  nature 
of  the  descriptive  element  (e.g.  edge,  termination)  to  which  they  may 
refer;  (ii)  organization  of  these  units  into  disparity  layers  (or 
stripes  or  columns);  (iii)  the  presence  of  reciprocal  excitation  within 
each  layer;  and  (iv)  the  presence  of  reciprocal  inhibition  between 
layers.  For  Marr  6 Poggio's  algorithm,  the  inhibition  should  exhibit 
the  characteristic  "orthogonal"  geometry  of  the  thick  lines  in  figure  2 
(the  lines  of  sight). 

We  turn  now  to  an  examination  of  the  available  empirical  evidence. 
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4 Evidence  from  neurophysiology  and  psychophysics 

4.1  Neurophi/siologu 

The  questions  of  interest  to  us  can  be  formulated  clearly,  as 
follows:  (i)  Are  there  disparity  detectors?  (ii)  If  so,  how  finely 
tuned  are  they,  and  what  range  of  disparities  is  covered  by  their  peak 
sensitivities?  For  example,  are  there  many,  or  are  there  Just  two  or 
three  (crossed,  uncrossed  and  possibly  zero  disparity)?  (iii)  Are  they 
organized  into  layers  or  columns  of  equal  disparities?  What  are  their 
excitatory  or  inhibitory  relationships  to  one  another?  (iv)  Are  the 
disparity  detectors  sensitive  to  specific  spatial  features  (e.fl. 
oriented  edges,  oriented  bars,  or  terminations)? 

Let  us  now  examine  the  evidence  that  is  available  about  these  four 
points. 

(i)  Although  most  physiologists  believe  that  disparity  detectors  do 
exist,  there  is  apparently  some  disagreement  about  the  cortical  area 
involved.  Barlow,  Blakemore  6 Pettigrew  (1967)  originally  reported  the 
existence  of  disparity  sensitive  units  in  the  primary  visual  cortex  of 
the  cat,  a finding  substantiated  by  several  subsequent  articles,  for 
example,  Pettigrew,  Nikara  6 Bishop  (1977)  and  Nelson,  Kato  6 Bishop 
(1977).  Hubei  6 Wiesel  (1970)  failed  to  find  depth-sensitive  neurons 
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in  area  17  of  the  macaque  monkey,  though  they  did  find  them  in  area  18. 
They  later  stated  that  this  situation  is  also  true  in  the  cat  (Hubei  8 
Wiesel  1973).  Recently,  Poggio  5 Fischer  (1977)  reported  depth 
sensitive  neurones  in  areas  17  and  18  of  the  alert  macaque  monkey. 

They  were  unable  to  offer  an  explanation  for  the  difference  between 
their' s and  Hubei  6 Wiesel 's  results,  apart  from  the  difference  in  the 
state  of  the  animals.  In  studies  on  the  cat,  Hubei  6 Wiesel  usually 
used  barbiturates,  whereas  all  the  other  investigations  were  carried 
out  under  nitrous  oxide. 

(ii)  Barlow  et  al.  (1967  figure  3)  reported  disparity  sensitive  cells 
in  the  cat  that  had  a range  of  about  6.  3 degrees  at  5 - 15  degrees 
eccentricity.  Pettigrew  et  c:.  (1968  figure  11)  described  cells  at  an 
eccentricity  of  8 degrees  tuned  to  a disparity  of  about  3 degrees  (see 
also  figure  9 of  Nikara,  Bishop  6 Pettigrew  1968).  In  the  monkey, 
Poggio  6 Fischer  (1977)  found  a range  of  optimal  disparity  sensitivity 
of  for  example  ±0.3  degrees  et  1 degree  eccentricity  (see  their  figure 
8). 

Little  is  certain  about  sharpness  of  disparity  tuning.  In  the 
cat,  figure  10  of  Nelson  et  al.  (1977)  exhibits  the  response  of  a 
disparity-sensitive  cell  that  s tuned  to  an  unknown  disparity  value, 
with  an  accuracy  of  ±0.5  degrees.  Bishop,  Henry  6 Smith  (1971  figure 
6c)  described  a cell  that  was  about  twice  as  finely  tuned.  In  the 
monkey,  Poggio  5 Fischer  (1977)  found  four  types  of  depth-sensitive 
cell  in  areas  17  and  18;  (a)  cells  excited  by  and  narrowly  tuned  to 
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stimuli  at  the  depth  of  the  plane  of  fixation;  (b)  cells  whose  response 
was  essentially  the  complement  of  (a)  (also  described  in  the  cat  by 
Pettigrew  et  al.  1968  p.  406);  (c)  near  neurons,  that  were  stimulated  by 
stimuli  in  front  of  the  fixation  plane  and  were  suppressed  by  those 
behind  it;  and  (d)  far  neurons,  the  opposite  of  near  ones.  Some  of  the 
class  (a)  cells  had  a disparity  tuning  as  sharp  as  3’,  whereas  some  of 
the  class  (b)  cells  exhibited  a total  range  of  binocular  interaction 
that  could  extend  to  more  than  ±1  degree  of  disparity. 

There  are  hints  of  a monotonic  relationship  between  eccentricity 
and  optimal  disparity  (Poggio  6 Fischer  1977  figure  8),  and  between 
receptive  field  size  and  the  sharpness  of  disparity  tuning  (Pettigrew 
et  al . figure  11). 

■ 

(iii)  Hubei  6 Weisel  (1970)  in  the  macaque  remarked  that  cells 
representing  a given  stereoscopic  depth  relative  to  the  surface  of 
fixation  are  grouped  together,  possibly  into  columns  (p  41). 

There  is  no  evidence  about  the  physiological  connections  among 
these  cells.  Almost  all  disparity-sensitive  cells  have,  however,  been 
reported  to  have  inhibitory  flanks  for  disparities  lying  outside  the 
tuning  range. 

I 

(iv)  Little  is  known  about  the  spatial  features  to  which  disparity 
neurons  are  sensitive,  for  instance,  whether  they  have  "bar-shaped"  or 
"edge-shaped"  receptive  fields.  Hubei  5 Wiesel  (1970)  imply  that  in 
the  monkey,  most  binocular  depth  cells  have  vertically  oriented, 


Human  stereopsis 


20 


Marr  5 Poggio 


elongated  receptive  fields.  Nelson  et  ol.  (1977)  found  that  binocular 
neurons  in  the  striate  cortex  of  the  cat  are  rather  insensitive  to 
differences  in  the  orientations  of  slits  or  bars  in  the  two  eyes.  They 
concluded  that  binocular  units  do  not  detect  tilt  directly. 


Comments 

Up  to  now,  the  questions  we  set  out  have  not  received  direct 
attention,  and  the  answers  are  at  best  uncertain,  partly  because  of 
contradictory  conclusions  from  different  laboratories.  Nevertheless, 
it  is  not  clear  why,  for  example,  disparity  tuning  curves  were  not 
measured  quite  soon  after  Barlow  et  al.’s  (1967)  original  work. 
Apparently,  only  Poggio  6 Fischer  (1977)  have  provided  careful  evidence 
on  this  point,  and  it  is  unfortunate  that  their  findings  contradict  the 
opinions  Implied  by  previous  workers. 

An  important  factor  contributing  to  this  state  of  affairs  has  been 
the  considerable  technical  difficulty  involved  in  experiments  on 
stereoscopic  disparity,  for  example,  the  precise  control  of  eye 
position  and  stimulus  eccentricity.  The  apparent  need  to  use  moving 
stimuli  and  slits  that  are  more  than  a few  minutes  of  arc  wide  make 
somewhat  uncertain  the  interpretation  of  even  those  measurements  that 
can  be  obtained.  We  believe  that  flashed  stimuli,  of  the  type  used  in 
psychophysics  to  avoid  eye  movements,  may  help  in  studies  of  these 
cells,  despite  the  extra  difficulties  they  Introduce. 

Taken  altogether,  the  physiological  evidence  is  not  compelling 
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about  even  the  basic  question  ((ii)  above)  of  whether  there  are  two 
pools  of  disparity  detectors,  or  several.  The  recent  work  of  Pogglo  5 
Fischer  seems  to  support  Richards'  (1970,  1971)  two  pools  idea. 


4.2  Psychophysics 

It  is  impossible  to  give  a brief  review  of  the  entire 
psychophysics  of  stereo  vision,  and  the  interested  reader  may  turn  to 
the  book  by  Julesz  (1971)  or  the  review  by  Richards  (1975),  which  also 
contain  extensive  bibliographies.  In  this  article,  we  restrict  our 
attention  to  points  that  we  regard  as  important  for  our  analysis  of  the 
computational  structure  of  the  stereo  disparity  problem.  We  divide  our 
survey  into  four  sections,  each  dealing  with  a different  aspect  of  the 
problem. 

The  relevance  oj  eye-movements. 

As  we  stated  earlier,  one  of  the  critical  preliminary  questions 
about  the  information  processing  structure  of  human  stereopsis  concerns 
the  relative  importance  of  neural  fusion  and  of  eye  movements. 
Unfortunately,  almost  without  exception  (Fender  6 Julesz  1967,  Evans  6 
Clegg  1967,  Richards  1977)  all  studies  using  random-dot  stereograms 
proceeded  by  viewing  the  pairs  with  free  eye  movements  (Julesz  1971) 
even  though  the  smallness  of  Panum's  fuslonal  area  (Fender  6 Julesz 
1967)  suggested  that  eye  movements  must  be  Important. 
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Although  some  observers  can  see  depth  in  simple  random-dot 
stereograms  that  are  presented  in  a flash  or  under  stabilized  image 
conditions,  eye-movements  (or  the  associated  retinal  motion)  are 
essential  for  many  observers  for  simple  stereograms,  and  even  then  the 
perceived  depth  may  be  ambiguous  or  inappropriate  (Richards  1977), 
except  possibly  in  the  disparity  range  0-13'  (Mayhew  5 Frisby  1978). 

For  complex  stereograms  such  as  Julesz's  spiral  (1971  fig.  4.5-4),  eye- 
movements  are  probably  essential  (Frisby  $ Clatworthy  1975,  Saye  <> 
Frisby  1975). 

Two  pools  or  many  disparity  layers? 

Richards  (1970,  1971)  and  Richards  5 Regan  (1973)  proposed  that 
the  mechanisms  underlying  stereoscopic  depth  perception  are  organized 
into  at  least  two  pools,  roughly  corresponding  to  crossed  and  uncrossed 
disparities.  Richards  based  this  proposal  on  a study  of 
"stereoanomalous"  observers,  who  are  able  to  process  one  of  these  kinds 
of  disparities  more  strongly  than  other. 

Although  these  data  do  not  rule  out  the  existence  of  many 


functional  layers,  they  suggest  the  genetic  Importance  of  the  two  pools 
idea,  end  this  itself  hints  at  its  functional  Implications. 
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Disparity  detectors  and  spatial  features 


In  the  monocular  situation,  the  visibility  of  a one-dimensional 
sinusoidal  grating  remains  unchanged  in  the  presence  of  masking  noise 
filtered  so  as  to  contain  no  spectral  components  nearer  than  two 
octaves  to  the  spatial  frequency  of  the  grating  (Stromeyer  5 Julesz 
1972).  The  equivalent  finding  holds  also  for  two-dimensional  patterns 
(Harmon  $ Julesz  1973).  Kaufman  (1964)  and  Julesz  (1971,  3.  9 5 3.10) 
found  that  one  can  simultaneously  experience  both  binocular  rivalry  and 
fusion  of  different  spectral  components  in  a stereogram.  Julesz  6 
Miller  (1975)  recently  put  this  finding  on  a quantitative  basis.  They 
selected  masking  noise  bands,  containing  equally  effective  noise 
energy,  such  that  their  bands  either  overlapped  the  stereoscopic  image 
spectrum  or  were  two  octaves  distant.  The  first  case  resulted  in 
rivalry,  but  in  the  second,  stereoscopic  fusion  (and  the  consequent 
perception  of  depth)  could  be  maintained  despite  the  presence  of  strong 
binocular  rivalry  caused  by  the  masking  noise. 

This  raises  the  possibility  that  disparity  Information  might  at 
some  stage  be  conveyed  by  independent  stereopsis  channels,  tuned  to 
different  spatial  frequencies,  and  roughly  one  octave  wide.  Mayhew  6 
Frisby's  (1976)  interesting  results  using  rivalrous  texture  stereograms 
are  also  consistent  with  this  idea. 

The  available  psychophysical  evidence  about  the  orientation 
sensitivity  of  these  channels  suggests  that  it  is  poor  (Julesz  1971  p. 
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89),  which  is  consistent  with  the  neurophysiological  findings  we 
reviewed  earlier  (Nelson  et  al.  1977). 

The  channels  found  by  Julesz  6 Miller  are  probably  the  same  as 
those  analyzed  by  several  investigators  (e.g.  Campbell  6 Robson  1968). 
In  spatial  terms,  such  channels  probably  correspond  to  receptive  fields 
that  are  bar-shaped  rather  than  edge-shaped  or  more  like  gratings  (see 
figure  5b  in  the  next  section). 

Interestingly,  it  appears  that  line  terminations  can  also  be 
matched  (Julesz  1971  p.  80,  see  also  p.  92,  Frisby  5 Julesz  1975). 

This  raises  the  question  of  whether  the  matching  of  terminations  relies 
on  their  explicit  extraction  from  an  image,  or  whether  it  is  an 
epiphenomenon  attributable  to  the  existence  of  narrowly  tuned  frequency 
channels  [cf.  Cowan's  (1977)  discussion  of  Shapley  Toihurst's  (1973) 
results] . 

There  is  independent  evidence  that  disparity  mechanisms  make  bar- 
by-bar  correlations  as  opposed  to  edge-by-edge  correlations  (Felton, 
Richards  6 Smith  1972).  Using  a modification  of  the  Blakemore  6 
Campbell  (1969)  adaptation  technique,  Felton  et  al.  presented  high- 
contrast  sine-wave  gratings  binocularly  at  and  off  the  plane  of 
fixation.  Under  these  conditions,  the  greatest  rise  in  threshold 
following  adaptation  occurs  for  test  gratings  presented  in  the  same 
plane  as  the  adapting  grating.  They  found  that  the  adapted  mechanisms 
have  narrow  rather  than  broad  spatial-frequency  tuning  curves. 


Another  very  important  question  which  these  results  naturally 
raise  is  whether  these  independent  spatial  channels  are  also  separated 
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in  the  disparity  domain,  that  is,  whether  they  are  further 
distinguished  by  the  range  of  disparity  values  they  can  convey. 

Felton,  Richards  5 Smith  (1972)  again  provided  evidence  on  this  point, 
concluding  that  over  the  1.0  degree  disparity  range  they  examined, 
narrow  bar  detectors  feed  small  disparity  mechanisms  whereas  wide  bar 
detectors  feed  large  disparity  mechanisms.  We  shall  propose  later  that 
wide  bar  detectors  can  in  fact  detect  small  disparities,  but  with 
poorer  resolution  than  small  bar  detectors. 

We  feel  that  indirect  evidence  about  this  point  may  come  from  the 
differing  reports  of  the  size  of  Panum’s  fuslonal  area.  Fender  $ 

Julesz  (1967)  gave  a figure  of  6'  for  Panum's  fusional  area  for  random- 
dot  stereograms  with  a dot  size  of  about  2'.  In  subsequent  experiments 
however,  Julesz  5 Chang  (1976)  routinely  flashed  stereograms  with  a 
range  of  disparities,  implying  that  those  up  to  ±18'  were  fused.  An 
attractive  explanation  for  this  discrepancy  is  that  Panum's  area 
depends  on  the  dot  size,  since  Julesz  f>  Chang's  were  6'  square.  This 
hypothesis  is  clearly  in  the  same  spirit  as  the  conclusion  of  Felton, 
Richards  & Smith. 


Hysteresis  and  cooperativity 

In  a seminal  paper2,  Fender  § Julesz  (1967)  demonstrated  the 
existence  of  hysteresis  in  stereopsis.  They  studied  the  fusion  of 
binocularly  stabilized  random-dot  stereoscopic  Images,  and  found  that 
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once  fused  (in  the  6’  Panum  area),  images  could  be  pulled  apart 
symmetrically  by  about  2 degrees  in  the  horizontal  direction  without 
loss  of  stereopsis  or  fusion.  This  finding  provided  the  basic  reason 
for  a widespread  belief  that  binocular  fusion  is  a cooperative  process. 
Later,  Julesz  adduced  in  its  further  support  the  phenomena  of  (i) 
disorder-order  transitions  and  multiple  stable  states  in  stereopsis 
(Julesz  and  Chang  1976  p.  117),  (ii)  the  pulling  effect  with  ambiguous 
random-dot  stereograms  (Julesz  6 Chang  1976),  and  (iii)  Julesz’s 
conclusion  (1971  p.  200)  that  "stereopsis  is  a parallel  process  in 
which  each  depth  plane  is  simultaneously  processed."  The  idea  that 
stereopsis  is  cooperative  formed  the  starting  point  for  all  attempts  at 
constructing  neural  models  for  this  computation  (c/.  section  1). 

In  their  original  paper,  Fender  5 Julesz  concluded  that  the 
labelling  of  corresponding  points  can  occur  only  within  Panum’s 
fusional  regioh,  but  that  under  appropriate  conditions,  these  labels 
could  then  be  preserved  for  large  retinal  image  shifts.  Their  original 
data,  however,  only  indicated  the  presence  of  a "simple  memory  process" 
which  they  then  chose  (p.  829)  to  call  hysteresis.  There  is  no 
evidence  that  this  hysteresis  is  intrinsic  to  the  labelling  process 
itself,  an  hypothesis  which  is  essential  if  stereopsis  is  to  be 
regarded  as  a simple  cooperative  phenomenon.  On  the  contrary,  the 
phenomenon  of  hysteresis  . ppears  over  a disparity  range  of  2 degrees, 
which  is  much  greater  th>.n  ev’en  the  largest  estimates  of  Panum’s 
fusional  area.  Futhermo • e,  lender  6 Julesz  (p.  829)  actually  "suggest 
the  existence  of  three  different  processes  in  stereopsis.  A labelling 
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process,  which  is  operative  in  Panum's  fusional  region,  establishes 
correlation  between  corresponding  areas  in  the  left  and  right  images 
having  various  disparities.  A cortical-registration  process  preserves 
these  labels  even  if  the  left  and  right  images  are  pulled  apart  on  the 
retinas...  [and]  convergence  motion  of  the  eyes,  which  compensate  for 
large  or  rapid  errors  of  disparity." 

It  seems  to  us  that,  while  the  notion  of  hysteresis  can  certainly 
be  applied  to  the  registration  process,  because  it  is  essentially  a 
memory,  there  is  no  direct  evidence  for  cooperativity  in  the  labelling 
process  itself. 


4.3  Conclusions 

Many  of  these  findings  cast  doubt  on  the  relevance  of  cooperative 
algorithms  to  the  question  of  the  fusion  process  in  human  stereo 
vision.  The  principal  points  are  (a)  the  apparently  crucial  role 
played  by  eye-movements  in  human  stereo  vision,  (b)  the  ability  of  some 
subjects  to  tolerate  a 15%  expansion  of  one  image  (Julesz  1971  figure 
2.8-8),  (c)  the  findings  about  independent  spatial-frequency-tuned 
channels  in  binocular  fusion,  of  which  our  tolerance  to  severe 
defocussing  of  one  image  is  a striking  demonstration  (Julesz  1971 
figure  3.10-3),  (d)  the  physiological,  clinical  and  psychophysical 
evidence  about  Richards'  three-pools  hypothesis,  and  (e)  the  size  of 
Panum's  fusional  area  (6'  - 18')  which  seems  surprisingly  small  to  have 
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to  resort  to  cooperative  mechanisms  of  neural  fusion  for  the 
elimination  of  false  targets. 

Finally,  we  may  mention  that  none  of  the  theories  based  on 
cooperativity  gives  a clear  indication  of  the  nature  of  the  spatial 
features  that  should  be  matched. 
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5 A theory  of  stereopsis 

Taken  together,  these  findings  indicate  that  an  approach  of  a 
quite  different  kind  to  the  problem  is  probably  necessary.  In  this 
section,  we  present  an  alternative  theory,  describing  firstly  a rough 
outline  and  the  ideas  that  led  us  to  it,  and  after  that  we  formulate 
the  theory  in  detail. 


5.1  An  outline  of  the  theory 

The  basic  computational  problem  in  binocular  fusion  is  the 
elimination  of  false  targets,  and  the  difficulty  of  this  problem  is  in 

direct  proportion  to  the  range  and  resolution  of  the  disparities  that 

are  considered.  The  problem  can  therefore  be  simplified  by  reducing 
either  the  range,  or  the  resolution,  or  both,  of  the  disparity 

measurements  that  are  taken  from  two  images.  An  extreme  example  of  the 

first  strategy  would  lead  to  a diagram  like  figure  2 in  which  only 
three  adjacent  disparity  planes  were  present  (e.  g.  +1,  0,  -1)  each 
specifying  their  degree  of  disparity  rather  precisely.  The  second 
strategy,  on  the  other  hand,  would  amount  to  maintaining  the  range  of 
disparities  shown  in  figure  2,  but  reducing  the  resolution  with  which 
they  are  represented.  In  the  extreme  case,  only  three  disparity  values 
would  be  represented,  crossed,  roughly  zero,  and  uncrossed. 
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These  schemes,  based  on  just  three  pools  of  disparity  values, 
substantially  eliminate  the  false  targets  problem  at  the  cost  on  the 
one  hand  of  a very  small  disparity  range,  and  on  the  other,  of  poor 
disparity  resolution.  Thus  the  price  of  computational  simplicity  is  a 
trade-off  between  range  and  resolution. 

One  would,  however,  expect  the  human  visual  system  to  possess  both 
range  and  resolution  in  its  disparity  processing.  In  this  connection, 
the  existence  of  independent  spatial-frequency-tuned  channels  in 
binocular  fusion  is  of  especial  interest,  because  it  suggests  that 
several  copies  of  the  image,  obtained  by  successively  finer  filtering, 
are  used  during  fusion,  providing  increasing  and,  in  the  limit,  very 
fine  disparity  resolution  at  the  cost  of  decreasing  disparity  range. 

A notable  Mature  of  a system  organized  along  these  lines  is  its 
reliance  on  eye-movements  for  building  up  a comprehensive  and  accurate 
disparity  map  from  two  viewpoints.  The  reason  for  this  is  that  the 
most  precise  disparity  values  are  obtainable  from  the  high-resolution 
channels,  and  eye-movements  are  therefore  essential  so  that  each  part 
of  a scene  can  ultimately  be  brought  into  the  small  disparity  range 
within  which  high  resolution  channels  operate.  The  importance  of 
vergence  eye-movements  is  especially  attractive  in  view  of  the  recent 
evidence  anout  their  role  in  human  stereopsis  (see  section  4.2  above), 
and  the  extremely  high  degree  of  precision  with  which  they  may  be 
controlled  (Riggs  6 F'*ihl  i960,  Rashbass  $ Westhelmer  1961a). 

These  observations  suggt st  a scheme  for  solving  the  fusion  problem 
in  the  following  way:  (1  Eacn  image  is  analyzed  through  channels  of 
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various  coarsenesses,  and  matching  takes  place  between  corresponding 
channels  from  the  two  eyes  for  disparity  values  of  the  order  of  the 
channel  resolution.  (2)  Coarse  channels  control  vergence  movements, 
thus  causing  finer  channels  to  come  into  corresondence. 

This  scheme  raises  a puzzle.  Since  it  contains  no  hysteresis,  it 
provides  no  explanation  for  the  basic  findings  that  led  Julesz  to 
conclude  that  binocular  fusion  is  a cooperative  process.  Recent  work 
in  the  theory  of  intermediate  visual  information  processing  argues  on 
computational  grounds  that  a key  goal  of  early  visual  processing  is  the 
construction  of  something  like  a "depth  map"  of  the  visible  surfaces 
round  a viewer,  (Marr  § Nishihara  1977  figure  2,  Marr  1977  section  3). 
The  motivation  for  this  proposal  is  that  a description  of  objects' 
shapes  has  to  be  derived  via  a description  of  their  visible  surfaces, 
and  information  about  these  is  obtainable  by  a number  of  different  and 
probably  independent  processes,  which  extract  disparity,  motion, 
shading,  texture  gradient  and  contour  information.  These  different 
types  of  information  need  to  be  combined,  in  a buffer  somewhere.  One 
proposal  for  carrying  this  out  is  the  construction  of  a representation 
that  makes  explicit  the  depth  and  orientation  of  visible  surface 
elements,  and  contours  of  surface  discontinuity,  in  a coordinate  frame 
that  is  centered  on  the  viewer  (Marr  1977  Table  2).  Marr  f>  Nishihara 
called  this  representation  the  2-j-D  sketch  (see  figure  4). 

The  important  point  here  is  that  the  2-j-D  sketch  is  in  some  sense 
a memoru,  and  it  is  this  idea,  together  with  the  remarks  of  Fender  5 
Julesz  that  we  quoted  above,  that  offers  a possible  solution  to  our 
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4.  The  2-j-D  sketch  represents  depth,  surface  orientation  and  contours 
of  discontinuities  in  these  quantities.  A convenient  representation  of 
surface  orientation  is  illustrated  in  (a).  The  orientation  of  the 
needles  is  determined  by  the  projection  of  the  surface  normal  on  the 
image  plane,  and  the  length  of  the  needles  represents  the  dip  out  of 
that  plane.  A typical  2-^-D  sketch  appears  in  (b),  although  depth 
information  is  not  represented  in  the  figure. 
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puzzle.  Suppose  that  the  hysteresis  Fender  5 Julesz  observed  is  not 
due  to  a cooperative  process  during  fusion,  but  is  in  fact  the  result 
of  using  a memory  buffer  in  which  to  store  the  depth  map  of  the  image 
as  it  is  discovered.  Then,  the  fusion  process  itself  need  not  be 
cooperative  (even  if  it  still  could  be),  and  in  fact  it  would  not  even 
be  necessary  for  the  whole  image  ever  to  be  fused  simultaneously, 
provided  that  a depth  map  of  the  viewed  surface  were  built  and 
maintained  in  this  intermediate  memory. 

Our  scheme  can  now  be  completed  by  adding  to  it  the  following  two 
steps:  (3)  when  a correspondence  is  achieved,  it  is  held  and  written 

down  somewhere  (e.  g. , in  the  2-^-D  sketch);  (4)  there  is  a backwards 
relation  between  the  memory  and  the  masks,  perhaps  simply  through  the 
control  of  eye-movements,  that  allows  one  to  fuse  any  piece  of  a 
surface  easily  once  its  depth  map  has  been  established  in  the  memory. 

We  turn  now  to  a more  detailed  analysis  of  these  ideas. 

5.2  The  nature  of  the  channels 

The  articles  by  Julesz  6 Miller  (1975)  and  Mayhew  6 Frisby  (1976) 
establish  that  spatial -frequency-tuned  channels  are  used  in  stereopsis 
and  are  independent.  Julesz  5 Miller's  findings  imply  that  two  octaves 
is  an  upper  bound  for  the  bandwidth  of  these  channels,  and  suggest  that 
they  are  the  same  channels  as  those  previously  found  in  monocular 
studies  (Campbell  6 Robson  1968,  Blakemore  5 Campbell  1969).  Although 
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strictly  speaking  it  has  not  been  demonstrated  that  these  two  kinds  of 
channel  are  the  same,  we  shall  make  the  assumption  that  they  are.  This 
will  allow  us  to  use  the  numerical  information  available  from  monocular 
studies  to  derive  quantitative  estimates  of  some  of  the  parameters 
involved  in  our  theory. 

The  idea  that  there  may  be  a range  of  different  size  or  spatial 
frequency  tuned  mechanisms  was  originally  introduced  on  the  basis  of 
psychophysical  evidence  by  Campbell  6 Robson  (1968).  This  led  to  a 
virtual  explosion  of  papeis  dealing  with  spatial  frequency  analysis  in 
the  visual  system.  Recently,  Wilson  (,  Gieze  (1977)  and  Cowan  (1977) 
integrated  these  and  other  anatomical  and  physiological  data  into  a 
coherent  logical  framework.  The  key  to  their  framework  is  (a)  the 
partitioning  of  the  range  of  sizes  associated  with  the  channels  into 
two  components,  one  due  to  spatial  inhomogeneity  of  the  retina,  and  one 
due  to  local  scatter  of  receptive  field  sizes;  (b)  the  correlation  of 
these  two  components  with  anatomical  and  physiological  data  about  the 
scatter  of  receptive  field  sizes  and  their  dependence  on  eccentricity. 

On  the  basis  of  detection  studies,  they  formulated  an  initial 
model  embodying  the  following  conclusions:  (1)  at  each  position  in  the 
visual  field,  there  exist  "bar-like"  masks  (see  figure  5a),  where 
tuning  curves  have  the  form  r>f  figure  5b,  and  which  have  a half-power 
bandwidth  of  about  an  octave.  (2)  The  bandwidth  of  the  local 
sensitivity  function  at  each  eccentricity  is  about  three  octaves. 

Hence  the  range  of  receptive  field  sizes  present  at  each  eccentricity 
is  about  4:  1.  In  other  w rdt.  at  least  three  and  probably  four 
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5.  (a)  Line  spread  functions  measured  at  two  different  eccentricities 

for  HRW.  The  points  are  fitted  using  the  difference  of  two  Gaussian 
functions  with  space  constants  in  the  ratio  1. 5: 1.  0.  The  inhibitory 
surround  exactly  balances  the  excitatory  centre  so  that  the  area  under 
the  curve  is  zero. 

(b)  Predictions  of  local  spatial  frequency  sensitivity  from 
frequency  gradient  data  and  from  line  spread  function  data.  The  local 
frequency  sensitivity  functions  are  plotted  as  solid  lines.  The  dashed 
lines  are  the  local  frequency  response  predicted  by  Fourier 
transforming  the  line  spread  functions  in  (a),  which  were  measured  at 
the  appropriate  eccentricities.  The  arrow  in  the  lower  graph  indicates 
a translation  of  the  dashed  curve  by  approximately  1.08  logjQ  units. 
(Redrawn  from  Wilson  $ Gleze  1977  figs.  9 6 10). 
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receptive  field  sizes  are  required  at  each  point  of  the  visual  field. 

(3)  Average  receptive  field  size  increases  linearly  with  eccentricity. 
In  humans  at  0 degrees,  the  mean  width  to  of  the  central  excitatory 
region  of  the  mask  is  about  6',  (range  3*  to  12');  and  at  4 degrees 
eccentricity,  w = 12'  (range  6'  to  24'),  (Wilson  6 Gieze  figure  9, 

Hines  1977  figures  2 6 3).  If  one  assumes  that  this  receptive  field  is 
described  by  the  difference  of  two  gaussian  functions  with  space 
constants  in  the  ratio  1:1.5,  the  corresponding  peak  frequency 
sensitivity  of  the  corresponding  channel  is  given  by  l/f  - X = 2.  3u». 
These  figures  agree  quite  well  with  physiological  studies  in  the 
Macaque.  Hubei  6 Wiesel  (1974)  reported  that  the  mean  width  of  the 
receptive  field  (5)  incieases  linearly  with  eccentricity  e (figure  6) 
(approximately,  5 = 0.  05e  + 0.25  degrees,  so  that  at  e = 4 degrees,  s = 
27'  which  gives  a value  for  10  = s/3  of  about  9'  as  opposed  to  12'  in 
humans).  The  data  of  Schiller  (1977  p.  1347  figures  12  6 14)  are  in 
rough  agreement  with  Hubei  6 Wiesel's.  (4)  Essentially  all  of  the 
psychophysical  data  on  the  detection  of  spatial  patterns  at  contrast 
theshold  can  be  explained  by  (1),  (2)  and  (3)  together  with  the 
hypothesis  that  the  detection  process  is  based  on  a form  of  spatial 
probability  summation  in  the  channels. 

With  the  characteristic  perverseness  of  the  natural  world,  this 
happy  and  concise  state  of  affairs  does  not  provide  a precise  account 
of  suprathreshold  conditions  (see  figure  7).  The  known  discrepancies 
can  however  be  explained  by  introducing  two  extra  hypotheses:  (5) 
Contrast  sensitivities  of  the  various  channels  are  adjusted 
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6.  Graph  of  average  receptive  field  size  (crosses)  and  magnification 
(open  circles)  against  eccentricity,  for  five  ‘■ortical  locations. 
Points  for  4,  8,  18  and  22  degrees  were  from  one  monkey;  for  1 degree 
from  a second.  Field  size  was  determined  by  averaging  the  fields  at 
each  eccentricity,  estimating  size  from  (length  x width)0,  (Redrawn 
from  Hubei  6 Wiesel  1974  fig.  6a). 
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appropriately  to  the  stimulus  contrast  (Georgeson  6 Sullivan  1975). 
The  point  of  this  is  merely  to  ensure  that  bars  of  the  same  contrast 
but  different  widths  actually  appear  to  have  the  same  contrast;  (6) 
Receptive  field  properties  change  slightly  with  contrast,  the 
inhibition  being  somewhat  decreased  in  low-contrast  situations  (Cowan 
1977  p.  511). 

In  a more  recent  article,  Wilson  & Bergen  (1978)  have  found 
that  the  situation  at  threshold  may  also  be  more  complicated. 

They  proposed  a model  consisting  of  four  size-tuned  mechanisms 
centred  at  each  point,  the  smaller  tuo  shouing  relatively 
sustained  temporal  responses,  and  the  larger  tuo  being  relatively 
transient.  As  far  as  is  knoun,  this  model  accurately  accounts 
for  all  published  threshold  sensitivity  studies. 

The  two  sustained  channels,  which  Uilson  & Bergen  call  N and 
S,  have  w values  3.1*  and  6.2';  the  transient  channels,  called  T 
and  U,  have  w' s of  11.7’  and  21’.  The  sizes  of  these  channels 
increase  with  eccentricity  in  the  same  way  as  described  above. 

The  S channel  is  the  most  sensitive  under  both  transient  and 
sustained  stimulation,  and  the  U channel  is  the  least,  having 
only  1/11  to  1/4  the  sensitivity  of  the  S channel.  The  extent  to 
which  the  U channel,  for  example,  plays  a role  in  stereopsis  is 
of  course  unknown. 

In  what  follows,  we  shall  assume  that  the  figures  given 
earlier  for  the  numbers  and  dimensions  of  receptive  field  centres 
and  their  scatter  hold  roughly  for  suprathresho I d conditions.  If 
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future  experiments  confirm  that  these  more  recent  numbers  are 
relevant  for  stereopsis,  some  modification  of  our  quantitative 
estimates  may  be  necessary. 

These  figures  allow  us  to  estimate  the  minimum  sampling  density 
required  by  each  channel,  t.e.  the  minimum  spatial  density  of  the 
corresponding  receptive  fields.  From  figure  10  of  Wilson  6 Gieze 
(1977),  a channel  with  peak  sensitivity  at  wavelength  X is  band-limited 
on  the  high-frequency  side  by  wavelengths  of  about  2X/3.  This  figure 
is  for  a threshold  criterion  of  15-30%,  but  is  rather  insensitive  to 
the  exact  value  chosen.  Hence  by  the  Sampling  Theorem  (Papoulis  1968, 
p.  119),  the  minimum  distance  between  samples,  (i.e.  receptive  fields), 
in  a direction  perpendicular  to  their  preferred  orientation,  is  at  most 
X/3.  Assuming  the  overall  width  of  the  receptive  field  is  about  3X/2, 
the  minimum  number  of  samples  per  receptive  field  width  is  about  4.5. 

An  estimate  of  the  minimum  longitudinal  sampling  distance  may  be 
obtained  as  follows.  Assume  that  the  receptive  field’s  longitudinal 
weighting  function  (see  table  1)  is  gaussian  with  space-constant  a, 
thus  extending  over  an  effective  distance  of  say  4<r  to  6<r.  Its 
fourier  transform  is  also  gaussian  with  space  constant  in  the  frequency 
domain  («)  of  1 /<r , and  for  practical  purposes  can  be  assumed  to  be 
band-limited  with  Jmax  = Z/2wa  to  2/2nv.  By  the  sampling  theorem,  the 
corresponding  minimum  sampling  intervals  are  v to  1. 5*,  i.e.  about  4 
samples  per  longitudinal  receptive  field  distance.  Hence  the  minimum 
number  of  measurements  (i.e  cells  or  receptive  fields)  per  receptive 
field  area  is  about  18.  If  one  assumes  that  the  density  of  sampled 
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image  points  is  constant  over  the  visual  field,  it  follows  that  the 
computational  effort  required  to  process  the  image  through  a given 
channel  is  roughly  independent  of  the  receptive  field  size  associated 
with  that  channel-*. 

This  model  of  the  preliminary  processing  of  the  image  Is 

5.3  The  domain  oj  the  matching  function 

In  view  of  this  information,  the  first  step  in  our  theory  consists 
of  filtering  the  left  and  right  spatial  images  through  four  bar  masks 
at  each  point  in  the  images.  We  assume  that  this  operation  is  roughly 
linear,  for  a given  intensity  and  contrast.  When  matching  the  left  and 
right  images,  one  cannot  simply  use  the  raw  values  measured  by  this 
first  stage,  because  they  do  not  correspond  directly  to  physical 
features  on  visible  surfaces  on  which  matching  may  be  based.  One  first 
has  to  obtain  from  these  measurements  some  symbol  that  corresponds  with 
high  probability  to  a physical  item  with  a well-defined  spatial 
position.  This  observation,  which  has  been  verified  through  computer 
experiments  in  the  case  of  stereo  vision  (Grimson  6 Marr  1978)  formed 
the  starting  point  for  a recent  approach  to  the  early  processing  of 
visual  information  (Marr  1974,  1976). 

Perhaps  the  simplest  way  of  obtaining  suitable  symbols  from  an 
image  is  to  find  signed  peaks  in  the  first  (directional)  derivative  of 
the  intensity  array,  or  alternatively,  zero-crossings  in  the  second 
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»t  each  point  in  the  visual  field  the  image  is  filtered  through  8.  ]„  each  position  there  are  four  receptive  field  sizes,  the  smallest 

receptive  fields  having  these  characteristics:  being  1/4  of  the  largest.  The  profile  R(x)  and  fourier  transform 
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derivative-  The  bar-masks  of  table  1 measure  an  approximation  to  the 
second  directional  derivative  at  roughly  the  resolution  of  the  mask 
size,  so  clear  signed  zero-crossings  in  the  convolution  values  obtained 
along  a scan  line  lying  perpendicular  to  the  receptive  field’s 
longitudinal  axis  (c /.  Marr  1976  figure  2)  would  specify  an  appropriate 
location  precisely.  The  fact  that  the  sign  of  the  zero-crossings  are 
important  is  consistent  with  experimental  data  like  that  of  Julesz 
(1963  figure  2).  Since  for  stereopsls,  a precise  estimate  of  only  the 
horizontal  coordinate  is  required,  in  principle  we  need  to  consider 
only  masks  having  vertically  oriented  receptive  fields.  It  is  known, 
however,  that  vertical  disparity  information  is  used  to  help  align  the 
two  eyes  (see  e.g.  Ogle,  Martens  6 Dyer  1967  chapter  11),  and  so  in  the 
human  visual  system,  masks  with  horizontally  oriented  receptive  fields 
may  be  used^. 

In  practice,  however,  it  is  not  enough  to  use  Just  vertically 
oriented  masks  to  obtain  horizontal  disparity  information.  Julesz 
(1971,  p 80)  showed  that  minute  breaks  in  horizontal  lines  can  lead  to 
fusion  of  two  stereograms  even  when  the  breaks  lie  close  to  the  limit 
of  visual  acuity.  Such  breaks  cannot  be  obtained  by  simple  operations 
on  the  the  measurements  from  even  the  smallest  vertical  masks.  These 
breaks  probably  have  to  be  localized  by  a specialized  process  for 
finding  terminations  by  examining  the  values  and  positions  of  rows  of 
*'zerd-crossings  obtained  from  horizontal  mask  convolutions  (cf.  Marr- 
1976  p.  496). 

Thus  not  only  zero-crossings  but  also  terminations  have  to  be  made 
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explicit,  (cf.  the  principle  of  explicit  naming,  Marr  1976  p.  485). 

The  matching  process  will  then  operate  on  descriptions,  of  the  left  and 
right  images,  that  are  built  of  these  two  symbolic  primitives,  and 
which  specify  their  positions,  the  mask  size  from  which  they  were 
obtained,  and  their  signs.  This  process  is  summarized  in  Table  2. 


6.4  notching 

At  the  heart  of  the  matching  problem  lies  the  problem  of  false 
targets.  If  false  targets  arise  in  profusion,  a somewhat  sophisticated 
algorithm  must  be  used  to  eliminate  them  (cf.  section  2). 

Computational  simplicity  can  be  preserved  only  if  false  targets  are 
rare,  and  the  existence  of  several  independent  spatial-frequency-tuned 
channels  provides  a way  of  accomplishing  this. 

We  propose  that,  for  each  set  of  masks  of  a given  size,  symbols  of 
the  same  type  (zero-crossing  or  termination)  and  sign  are  matched 
between  the  two  images.  If  each  channel  were  very  narrowly  tuned  to  a 
wavelength  A,  the  minimum  distance  between  zero-crossings  of  the  same 
sign  in  each  image  would  be  about  A.  In  this  case,  matching  would  be 
unambiguous  in  a disparity  range  up  to  A.  The  same  argument  holds 
qualitatively  for  the  actual  channels,  but  because  they  are  not  so 
narrowly  tuned,  the  disparity  range  for  unambiguous  matching  will  be 
smaller  and  must  be  estimated.  This  may  be  done  in  the  following  way. 
The  argument  is  carried  out  for  zero-crossings,  since  terminations  are 


1 


Table  2 

Step  2 of  the  stereopsis  computation-.  Zero-crossings  and  terminations 

(a)  The  outputs  of  each  of  the  four  filters  (for  each  value  of  ut, 
evaluated  at  9 (vertical  orientation)  are  scanned  along  the  horizontal 
direction  (9  = 0),  and  the  positions  of  positive-  and  negative-sloped 
zero-crossings  are  found. 

(b)  Step  (a)  is  also  carried  out  at  9 = 0,  and  the  spatial 
distribution  of  the  zero-crossings  and  amplitudes  of  the  associated 
gradients  are  examined  for  the  information  they  provide  about  the 
positions  of  terminations. 

■ 

(c)  Formally,  zero-crossings  and  terminations  may  be  defined  as 

follows:  Zero-crossing  positions  {x*,  g*)  are  the  non-trivial  solutions 

to  (1)  fWi9o  = 0 for  zero-crossings  from  the  vertical  masks 

(2)  Fw  o = 0 for  zero-crossings  from  the  horizontal  masks 

The  non-trivial  solutions  to  (2)  define  a set  of  curves  In  x and  g, 
each  given  parametrically  by  (x(s).  g(s)).  Then  termination  points 
(£,  if)  are  the  non-trivial  solutions  to 

(3)  d2(fw  g)/ds2  = 0,  i.e.  taking  the  derivatives  along  each 

of  the  curves  (x(s).  g(s)). 
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sparser  and  pose  less  of  a false-target  problem. 


Statistical  analysis  of  zero-crossings 

The  quantity  of  interest  is  the  probability  distribution  of  the 
interval  between  adjacent  zero-crossings  of  the  same  sign  in  the 
filtered  image.  This  depends  on  (a)  the  image  characteristics,  and  (b) 
the  filter  characteristics.  For  (a),  we  assume  that  the  input  to  the 
masks  is  gaussian.  More  precisely,  if  I(x,  y)  is  the  mask  input  at 
coordinate  (x,  y),  and  h(y)  represents  the  longitudinal  weighting 
function  of  the  mask  (see  table  1),  our  assumption  is  that 
f(x)  - f I(x,  y)h(y)dy  is  a gaussian  process. 

For  (b),  we  examine  two  extreme  cases.  Since  the  actual  filters 
have  a half-power  bandwidth  of  one  octave,  the  first  case  we  consider 
is  that  of  an  ideal  linear  band-pass  filter  of  width  one  octave,  as 
illustrated  in  figure  8a.  The  second  case  (figure  8b)  is  the  receptive 
field  suggested  by  the  threshold  experiments  of  Wilson  6 Gieze  (1977), 
consisting  of  excitatory  and  innibitory  gaussian  distributions,  with 
space-constants  in  the  ratio  1:1.5,  (see  figures  5 and  7a).  In  both 
cases,  the  filtered  image  Is  a gaussian  zero-mean  process.  We  also 
take  the  worst  case,  i.e.  that  in  wnlch  the  power  spectrum  of  the 
channel's  input  is  essentially  white  in  the  relevant  spectral  range. 

Our  problem  is  now  r;Juced  to  that  of  finding  the  distribution  of 
the  intervals  between  alu'mte  zero-crossings  by  a stationary  normal 


8.  Interval  distribution  for  zero-crossings.  A "white"  gaussian 
process  is  passed  through  a filter  with  the  frequency  characteristic 
(transfer  function)  shown  in  (1).  The  interval  distribution  for  the 
first  (P0)  and  second  (Pj)  zero-crossings  of  the  resulting  zero-mean 
gaussian  process  are  approximated  in  (2).  Given  a zero-crossing  at  the 
origin,  the  probability  of  having  at  least  two  within  a distance  £ is 
approximated  by  the  integral  of  Pt  and  shown  in  (3).  In  (A),  these 
quantities  are  given  for  an  ideal  band-pass  filter  one  octave  wide  and 
with  centre  frequency  w = 2n/\;  (B)  represents  the  case  of  the 

receptive  field  described  by  Cowan  (1977)  and  Wilson  6 Gieze  (1977). 

The  corresponding  spatial  distribution  of  excitation  and  inhibition, 
i.e.  the  inverse  fourier  transform  of  (Bl)  appears,  in  the  same  units, 
in  table  1.  For  case  (a)  a probability  level  of  Pt  = 0.001  occurs 
at  ( = 2.3,  and  a probability  level  of  0.5  occurs  at  ( = 6.1.  The 
corresponding  figures  for  case  B are  £ = 1.  5 and  ( = 5.  4. 

Situation  B is  derived  from  channel  characteristics  at  threshold 
(see  figures  8a  and  6),  and  represent  the  worst  case  likely  from  the 
point  of  view  of  stereo  matching.  Situation  A is  closer  to  Cowan's 
(1977)  guess  at  the  suprathreshold  condition  (figure  8b).  In  situation 
B,  the  ratio  of  the  space  constants  for  excitation  and  inhibition 
(table  1 and  figure  6)  is  1:1.5;  the  values  of  change  by  not  more 

than  5%  if  this  ratio  is  1:1.75  (Wilson  1978b). 
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process.  Many  authors  have  considered  this  problem,  dating  from  the 
pioneering  work  of  Rice  (1945),  (see  for  example  Longuet-Higgins  1962, 
Leadbetter  1969). 

Assume  that  there  is  a zero-crossing  at  the  origin,  and  let  Pq($), 
Pl<i)  be  the  probability  densities  of  the  distances  to  the  first  and 
second  zero-crossings.  P0  and  P j are  approximated  by  the  following 
formulae  (Rice  1945  section  3.  4,  Longuet-Higgins  1962  eqs.  1. 2.  1 6 
1.  2.  3): 


p„U) 


1 4^(0)  1/2  Mp -d  ( C ) ■> 

27  HI?]—  (*2(°>  - + H(0  cot"  (-H(fJ)] 


P^C) 


...y(Q)i 

V(0) 


1/2  M23(5) 

h irr 


(f2(0)  - - HU)  cot_1(H(c))] 


where  +(£)  is  the  autocorrelation  of  the  underlying  stochastic  process, 

and  ’ denotes  differentiation  w. r. t 

-1/2 

H(C)  = M23(0[M22(d  - M|3(C)] 

M22(c)  = -t"  (0)U2(0)  - t2(rj)  - H'(0)Tl2(fJ 

M23(f-)  = y"  U)U2(o)  - t2U))  + 'f(f,)'c,2(0 

These  approximations  cease  to  be  accurate  for  large  values  of  £,  (f.e. 

of  order  X),  where  2ir/X  is  the  centre  frequency  of  the  channel;  see 
Longuet-Higgins  (1962)  for  a discussion  of  various  approximations). 
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where  they  overestimate  P0  and  Pj.  Pj(k)  is  the  quantity  of  interest 
here,  since  it  is  the  interval  distribution  between  zero-crossings  of 
the  same  sign. 

Pq  and  Pt  were  computed  for  the  two  filters  of  figures  8al  & bl, 
and  they  are  plotted  in  figures  8a2  $ b2.  The  integrals  of  the  two  Pt 
curves  appear  in  figures  8a3  § b3.  From  these  graphs,  we  see  for 
example  that  the  0.05  probability  level  for  the  presence  of  false 
targets  occurs  at  £ = 4.  1 (approximately  X/ 1.  52)  for  the  ideal  band- 
pass filter  one  octave  wide,  centred  on  wavelength  A (figure  8al),  and 
at  £ = 3.  1 for  the  receptive  field  of  figure  8bl.  In  this  case,  £ is 
approximately  A/2,  where  A is  the  principal  wavelength  associated  with 
the  channel,  and  A = 2.  2w,  where  to  is  the  measured  width  of  the  central 

excitatory  area  of  the  receptive  field.  Thus  in  this  case,  the  95* 

confidence  limit  occurs  at  a disparity  approximately  equal  to  id  (|  = 

3.  1,'  w = 2.  8) . 

At  the  0.001  probability  level,  the  ideal  band-pass  filter  is  50* 
better  (the  corresponding  { is  larger)  than  the  receptive  field  filter 
with  the  same  centre  frequency;  at  the  0.05  probability  level  it  is  30* 
better;  and  at  the  0.5  probability  level,  it  is  13*  better.  The  legend 
to  figure  8 provides  more  details  about  these  results. 

We  have  made  a similar  comparison  between  the  sustained  and 
transient  channels  of  Wilson  (1978a)  and  of  Wilson  6 Berger  (1978).  If 

the  sustained  channels  correspond  to  the  case  of  figure  8b,  the 

transient  channels  have  a larjer  ratio  of  the  space  constants  for 
inhibition  and  excitation,  a somewhat  larger  excitatory  space-constant, 
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and  an  excitatory  area  larger  than  the  inhibitory.  Even  under  these 
conditions,  the  values  change  only  slightly. 


The  matching  process 


We  now  apply  the  results  of  these  calculations  to  the  matching 
process,  and  show  that  within  a given  channel  there  are  essentially  two 
possible  ways  of  dealing  with  false  targets.  If  one  wishes  to  avoid 
false  targets  altogether,  the  disparity  range  over  which  a match  is 
sought  must  be  restricted  to  ±w/2  (see  figure  9a).  For  suppose  zero- 
crossing L in  the  left  image  matches  zero-crossing  R in  the  right 
image.  The  above  calculations  assure  us  that  the  probability  of 
another  zero-crossing  of  the  same  sign  within  w of  R in  the  right  image 
is  less  than  0.05.  Hence  if  the  disparity  between  the  images  is  less 
than  w/2,  a search  for  matches  in  the  range  ±w/2  will  yield  only  the 
correct  match  R (with  probability  0.95).  Such  a low  error  rate  can  be 
accomodated  without  resorting  to  sophisticated  algorithms.  For 
example,  two  reasonable  ways  to  increase  the  matching  reliability  are 
(a)  to  demand  rough  agreement  between  the  slopes  of  the  matched  zero- 
crossings,  and  (b)  to  fail  to  accept  an  isolated  match  all  of  whose 
neighbours  give  different  disparity  values.  Of  course  if  the  disparity 
between  the  images  exceeds  w/2,  this  procedure  will  fail,  a 


circumstance  that  we  discuss  later. 

There  is,  however,  an  alternative  strategy,  that  allows  one  to 


RIGHT 


LEFT 


L 


case  2 d - w 


RIGHT 


9.  The  matching  process.  A zero-crossing  L in  the  left  image  matches 
one  R displaced  by  disparity  d in  the  right  image.  The  probability  of 
a false  target  within  w of  R is  small,  so  provided  that  d < w/2,  (case 
A),  almost  no  false  targets  will  arise  in  the  disparity  range  ±w/2. 

This  gives  the  first  possible  algorithm.  Alternatively  (case  B),  all 
matches  within  the  range  ±ro  may  be  considered.  Here,  false  targets  (F) 
can  arise  in  about  50%  of  the  cases,  but  the  correct  solution  is  also 
present.  If  the  correct  match  is  convergent,  the  false  target  will 
with  high  probability  be  divergent.  Therefore  in  the  second  algorithm, 
unique  matches  are  accepted  as  correct,  and  the  remainder  as  ambiguous 
and  subject  to  the  "pulling  effect",  illustrated  in  case  C.  Here,  Lj 
could  match  Rj  or  R2,  but  L2  can  match  only  R2.  Because  of  this,  and 
because  the  two  matches  have  the  same  disparity,  Lj  is  assigned  to  Rj. 
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deal  with  the  matching  problem  over  a larger  disparity  range.  Let  us 
consider  the  possible  situations  if  the  disparity  between  the  images  is 
d,  where  Idl  < w (figure  9b).  Observe  firstly  that  if  d > 0,  the 
correct  match  is  almost  certainly  (p  < 0.05)  the  only  convergent 
candidate  in  the  range  (0,  w) . Secondly,  the  probability  of  a 
(divergent)  false  target  is  at  most  0.5.  Therefore,  50*  of  all 
possible  matches  will  be  unambiguous  and  correct,  and  the  remainder 
will  be  ambiguous,  mostly  consisting  of  two  alternatives,  one 
convergent  and  one  divergent,  one  of  which  is  always  the  correct  one. 

In  the  ambiguous  cases,  selection  of  the  correct  alternative  can  be 
based  simply  on  the  sign  of  neighbouring  unambiguous  matches.  This 
algorithm  will  fail  for  image  disparities  that  significantly  exceed  ±w, 
since  the  percentage  of  unambiguous  matches  will  be  too  low  (roughly 
0.  2 for  ±1.  5u») . 

Sparse  images  like  an  isolated  line  or  bar,  that  yield  few  or  no 
false  targets,  pose  a different  problem.  They  often  give  rise  to 
unique  matches,  and  may  therefore  be  relied  upon  over  quite  a large 
disparity  range.  Hence  if  the  above  strategy  fails  to  disclose 
candidate  matches  in  its  disparity  range,  the  search  for  possible 
matches  may  proceed  outwards,  ceasing  as  soon  as  one  is  found. 

In  summary  then  (see  table  3)  there  are  two  immediate  candidates 
for  matching  algorithms.  The  simpler  is  restricted  to  a disparity 
range  of  ±w/2,  and  in  its  most  straightforward  form  will  fall  to  assign 
5%  of  the  matches.  The  second  involves  some  straightforward 
comparisons  between  neighbouring  matches,  but  even  before  these 
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comparisons,  the  50%  unambiguous  matches  could  be  used  to  drive  eye- 
movements,  and  provide  a rough  sensation  of  depth. 

The  implementation  of  the  first  of  these  algorithms  is 
straightforward.  The  second  one  can  be  implemented  most  economically 
using  two  "pools",  one  sensitive  in  a graded  way  to  convergent  and  the 
other  to  divergent  disparities  (see  figure  10).  Notice  that,  in  this 
sense,  the  first  algorithm  requires  only  one  "pool",  that  is,  a single 
unit  sensitive  in  a graded  way  to  the  disparity  range  ±u>/2. 

In  the  second  algorithm,  matches  that  are  unambiguous  or  already 
assigned  can  "pull"  neighbouring  ambiguous  matches  to  whichever 
alternative  has  the  same  sign.  This  may  be  related  to  the  "pulling 
effect"  described  in  psychophysical  experiments  by  Julesz  5 Chang 
(1976).  Notice  however  tnat  this  algorithm  requires  the  existence  of 
pulling  only  across  pools  and  not  within  pools  (in  the  terminology  of 
dulesz  $ Chang  p. 119). 

Disparities  larger  than  w can  be  examined  in  very  sparse  images. 
If,  for  example,  both  primary  pools  (covering  a disparity  range  of  ±w) 
are  silent,  detectors  operating  outside  this  range,  possibly  with  a 
broad  tuning  curve,  may  be  consulted.  In  a biologically  plausible 
implementation,  these  detectors  should  be  inhibited  by  activity  in  the 
primary  pools  (see  figure  10).  It  is  'emptJng  to  suggest  that 
detectors  for  these  out  > ng  disp/rlties  (i.e.  exceeding  about  iw)  may 
give  rise  to  depth  sensations  and  eye-movement  control  in  diploplc 
condl  t ions. 

If  t he  image  is  nrt  sparse,  and  the  disparity  exceeds  the 


Table  3 


Step  3 oj  the  stereopsis  computation:  Hatching  ( algorithm  2) 

(a)  For  each  zero-crossing  or  termination  of  a given  sign  in  one 
image,  matches  are  sought  in  the  other  in  the  range  ±w.  If  a unique 
match  is  found,  it  is  read  by  the  memory.  In  no  more  than  50%  of  the 
cases  the  match  will  be  ambiguous, , involving  usually  one  convergent  and 
one  divergent  candidate,  one  of  which  is  correct.  The  signs  of 
neighbouring  matches  that  are  unambiguous  or  already  assigned 
determines  the  choice  in  these  cases.  The  disparity  of  the  chosen  pool 

(either  divergent  or  convergent)  is  read  into  the  memory. 

% 

(b)  It  may  happen  that  no  matches  can  be  found  in  this  range.  If 
failure  occurs  for  a significant  pecentage  of  the  zero-crossings  within 
a small  neighbourhood  of  the  fixation  point,  then  it  is  assumed  that 
disparities  outside  the  range  ±u>  occur  there. 
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operating  range,  both  algorithms  will  fail.  Can  the  failure  be 
recognized  simply  at  this  low  level? 

For  the  first  algorithm,  no  correct  match  will  be  possible  in  the 
range  ±w/2.  The  probability  of  a random  match  in  this  range  is  about 
0.4,  t.e.  significantly  less  than  1.0.  When  the  disparity  between  the 
two  images  lies  in  the  range  ±w/2,  there  will  always  be  at  least  one 
match.  It  is  therefore  relatively  easy  to  discriminate  between  these 
two  situations. 

For  the  second  algorithm,  an  analogous  argument  applies:  in  this 
case  the  probability  of  no  candidate  match  is  about  0.3  for  image 
disparities  lying  outside  the  range  ±w,  and  zero  for  disparities  lying 
within  it.  Again,  it  is  relatively  easy  to  discriminate  between  the 
situations. 


Implications  for  psychophysical  measurements  of  Panum's  fusional  area 

Using  the  second  of  the  above  algorithms,  matches  may  be  assigned 
correctly  for  a disparity  range  ±w.  The  precision  of  the  disparity 
values  thus  obtained  should  be  quite  high,  and  a roughly  constant 
proportion  of  w (which  one  can  estimate  from  stereoacuity  results  at 
about  id/ 20) . For  foveal  channels,  this  means  ±3'  disparity  with 
resolution  10"  for  the  smallest,  and  ±12'  (perhaps  up  to  ±20'  if  Wilson 
6 Bergen  (1978)  holds  fo*-  sWeopsis)  with  resolution  40"  for  the 
largest  ones.  At  4 degrees  eccentricity,  the  range  is  ±5.3’  to  about 


±34'.  We  assume  that  this  range  corresponds  to  stereoscopic  fusion, 
and  that  outside  it  one  enters  diplopic  conditions,  in  which  disparity 
can  be  estimated  only  for  relatively  sparse  images. 

Under  these  assumptions,  our  predicted  values  apparently 
correspond  quite  well  to  available  measures  of  the  fusional  limits 
without  eye  movements.  Mitchell  (1966)  used  small  flashed  line  targets 
and  found,  in  keep!"*  with  earlier  studies,  that  the  maximum  amout  of 
convergent  or  divergent  disparity  without  diplopia  is  10-14*  in  the 
fovea,  and  about  30'  at  5 degrees  eccentricity.  The  extent  of  the  so- 
called  Panum  fusional  area  is  therefore  twice  this^. 

Under  stabilized  image  conditions,  Fender  6 Julesz  (1967)  found 
that  fusion  occurred  between  line  targets  (13*  by  1 degree  high)  at  a 
maximum  disparity  of  40'.  This  value  probably  represents  the  whole 
extent  of  Panum' s fusional  area.  Using  the  same  technique  on  a random- 
dot  stereogram,  Fender  6 Julesz  arrived  at  a figure  of  14*  (6' 
displacement  and  8’  disparity  within  the  stereogram).  Since  the  dot 
size  was  only  2',  one  may  expect  more  energy  in  the  high  frequency 
channels  than  in  the  low,  which  would  tend  to  reduce  the  fusional  area. 
Julesz  6 Chang  (1976),  using  a 6'  dot  size  over  a visual  angle  of  5 
degrees,  routinely  achieved  fusion  up  to  ±18'  disparity.  Taking  all 
factors  into  account,  these  figures  seem  to  be  consistent  with  our 
expectations. 

The  matching  process  is  summarized  in  table  3. 


10.  An  implementation  of  the  second  matching  algorithm.  For  each  mask 
size  of  central  width  in,  there  are  two  pools  of  disparity  detectors, 
signalling  crossed  or  uncrossed  disparities  and  spanning  a range  of  tin. 
There  may  be  an  additional  pool  of  detectors  finely  tuned  to  zero 
disparity.  Additional  diplopic  disparities  probably  exist  beyond  this 
range.  They  are  vetoed  by  detectors  of  smaller  absolute  disparity. 
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\\ 

5.5  Dynamic  memory  storage t the  2-y-0  sketch 


According  to  our  theory,  once  matches  have  been  obtained  using 
masks  of  a given  size,  they  are  represented  in  a temporary  buffer. 
These  matches  also  control  vergence  movements  of  the  two  eyes,  thus 
allowing  information  from  large  masks  to  bring  small  masks  into  their 
range  of  correspondence.  We  postpone  a detailed  discussion  of  this 
process  until  the  next  section. 


Why  a memory? 

The  reasons  for  postulating  the  existence  of  a memory  are  of  two 
kinds,  those  arising  from  general  considerations  about  early  visual 
•processing,  and  those  concerning  the  specific  problem  of  stereopsis. 

As  we  have  seen,  a memory  like  the  2-y-D  sketch  (see  figure  5),  is 
computationally  desirable  because  it  makes  explicit  information  about 
the  image  in  a form  that  is  closely  matched  to  what  early  visual 
processes  can  deliver  (Marr  1977  section  3.6  and  Table  1).  It  is 
possible  and  reasonable  to  synthesize  the  outputs  of  various  early 
processes  in  such  a representation  because  the  information  they  extract 
from  images  has  a well-defined  physical  interpretation,  namely  the 
shape  of  the  visible  surfaces.  The  2-y-D  sketch  describes  these 
processes'  results  by  representing  the  relative  depth  and  surface- 
orientation  associated  with  each  viewing  direction,  together  with 
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contours  of  discontinuity  in  depth  or  surface-orientation. 

The  more  particular  reason  associated  specifically  with  stereopsis 
is  the  computational  simplicity  of  the  matching  process,  which  requires 
a buffer  in  which  to  preserve  its  results  as  (1)  disjunctive  eye 
movements  change  the  plane  of  fixation,  and  (2)  objects  move  in  the 
visual  field.  In  this  way,  the  2-j-D  sketch  becomes  the  place  where 
"global"  stereopsis  is  actually  achieved,  combining  the  matches 
provided  independently  by  the  different  channels  and  making  the 
resulting  disparity  map  available  to  other  visual  processes. 


The  nature  of  the  memory 

The  2-j-D  sketch  is  a dynamic  memory  with  considerable  intrinsic 
computing  power.  It  is  perhaps  worth  stressing  that  it  belongs  to 
early  visual  processing,  and  cannot  be  influenced  directly  from  higher 
levels,  for  example  via  verbal  instructions,  a priori  knowledge  or  even 
previous  visual  experience  (Ono  5 Nakamazio  1977,  Frisby  6 Clatworthy 
1975,  and  remarks  of  Marr  1977  section  3 about  Ittelson  1960). 

Although  we  have  little  direct  evidence  about  the  memory,  one 
would  expect  a number  of  constraints  derived  from  the  physical  world  to 
be  embedded  in  its  internal  structure.  For  example,  the  rule  R2  of 
section  1,  that  disparity  changes  smoothly  almost  everywhere,  might  be 
implemented  in  the  2-j-D  sketch  by  connexions  similar  to  those  that 
Implement  it  in  Marr  6 Pogglo's  (1976)  cooperative  algorithm  (figure 
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2c).  This  active  rule  in  the  memory  may  be  responsible  for  the 
sensation  of  a continuous  surface  to  which  even  a sparse  stereogram  can 
give  rise  (Julesz  1971  figure  4.5-5,  Grimson,  Marr  6 Nishihara  1978). 

We  would  expect  other  constraints  to  be  embedded  there  in  a 
similar  way,  for  example  the  continuity  of  discontinuities  in  the 
visible  surfaces,  which  we  believe  underlies  the  phenomenon  of 
subjective  contours  (Marr  1977  section  3.6).  It  is  possible  that  even 
more  complicated  consistency  relations,  concerning  the  possible 
arrangements  of  surfaces  in  three-dimensional  space,  are  realized  by 
computations  in  the  memory,  (e.  g.  constraints  in  the  spirit  of  those 
made  explicit  by  Waltz  1975).  Such  constraints  may  eventually  form  the 
basis  for  an  understanding  of  phenomena  like  the  Necker-cube  reversal. 

From  this  point  of  view,  it  is  natural  that  many  illusions 
concerning  the  interpretation  of  three-dimensional  structure  (the 
Nec'ker  cube,  subjective  contours,  the  Muller-Lyer  figure,  the 
Poggendorff  figure,  etc.)  should  take  place  after  stereoscopic  fusion 
(see  Julesz  1971,  Blomfield  1973). 


The  information  represented  in  the  2-\-D  sketch 

The  2-j-D  sketch  represents  the  surface  orientation,  relative 
depth,  and  contours  of  discontinuities  of  these  quantities  in  a scene. 
The  exact  form  of  the  representation  remains  an  open  question,  both 
from  the  computational  and  biological  points  of  view,  (Marr  1977 
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section  3).  Because  of  the  variety  of  purposes  for  which  it  has  to  be 
used,  one  would  expect  it  to  be  easy  to  obtain  any  of  the  zeroth,  first 
or  second  derivatives  of  depth  in  any  direction  in  the  visual  field. 

The  representational  question  here  is,  are  all  these  quantities  stored 
directly,  or  are  some  obtained  from  the  others  on  demand? 

For  the  purposes  of  this  article,  however,  we  can  imagine  that  the 
memory  records  the  matches  obtained  by  binocular  fusion  in  the  form 
(xL,  yL;  uh;  d),  that  is,  the  matching  coordinate  on  the  left  and 
right  images  together  with  specification  of  the  corresponding  depth  d 
relative  to  some  reference  point  in  the  scene. 


Dynamic  management  of  stored  information 

According  to  this  theory,  the  memory  preserves  depth  (or 
disparity)  information  during  the  scanning  of  a scene  with  disjunctive 
eye-movements,  and  during  movement  of  viewed  objects.  Information 
management  will  have  limitations  both  in  depth  and  in  time,  and  the 
main  questions  here  are  over  what  range  of  disparities  can  the  2-j-D 
sketch  maintain  a record  of  a match  in  the  presence  of  incoming 
information,  and  how  long  car  it  do  this  in  its  absence.  The  temporal 
question  is  less  interesting  because  the  purpose  of  the  buffer  is  to 
organize  incoming  perceptual  Information,  not  to  preserve  it  when  there 
is  none.  In  fact,  Fender  5 .Julesz's  (1967)  occluded  one  image  of  a 
random-dot  stereogram,  and  found  that  fus'on  was  destroyed  for 


occlusions  times  longer  than  about  200  msec  (see  their  figure  10). 


The  spatial  aspects  of  the  2-j-D  sketch  raise  a number  of 
interesting  questions.  Firstly,  are  the  maximal  disparities  that  are 
preserved  by  the  memory  in  stabilized  image  conditions  the  same  as  the 
maximum  range  of  disparities  that  are  simultaneously  visible  in  a 
random-dot  stereogram  under  normal  viewing  conditions?  Secondly,  does 
the  distribution  of  the  disparities  that  are  present  in  a scene  affect 
the  range  that  the  memory  can  store?  For  example,  is  the  range  greater 
for  a stereogram  of  a spiral,  in  which  disparity  changes  smoothly,  than 
in  a simple  square-and-surround  stereogram  of  similar  overall 
disparity? 

For  the  first  question,  the  available  evidence  seems  to  indicate 
that  the  range  is  the  same  in  the  two  cases.  According  to  Fender  6 
Julesz  (1967),  the  range  is  about  2 degrees  for  a random-dot 
stereogram.  When  the  complex  stereograms  given  by  Julesz  (1971  e.  g. 

4.5-3)  are  viewed  from  about  20  cms,  they  give  rise  to  disparities  of 

about  the  same  order.  If  this  were  true,  it  would  imply  that  the 

maximal  range  of  simultaneously  perceivable  disparities  is  a property 

of  the  2-j-D  sketch  alone,  and  is  independent  of  eye  movements.  < 

Fender  6 Julesz  (1967)  reported  that  under  their  experimental 
conditions,  the  maximum  range  for  line  stimuli  (13’  wide)  was  less 
(about  70')  than  the  two  degree  range  for  random-dot  stereograms.  It 
may  well  be  that  "cooperative"  effects  in  the  2-j-D  sketch,  that  arise 


from  the  Implementation  of  rule  R2  ("filling-in"  phenomena),  may 
increase  the  maximum  storable  disparity  range  for  textured  surfaces. 


Human  stereopsis 


64 


Marr  6 Pogglo 


Foley,  Applebaum  5 Richards  (1975),  however,  obtained  a figure  of  2 
degrees  for  18’  wide  line  stimuli  flashed  for  40  msec.  This 
discrepancy  may  be  due  in  part  to  contrast  and  luminance  effects. 

In  addition  to  these  restrictions  on  the  overall  disparity  range 
of  the  2-|-D  sketch,  there  may  also  be  limitations  on  the  degree  of 
steepness  in  depth  of  a surface  that  can  be  represented.  At  some 
point,  too  steep  a gradient  in  a surface  will  be  represented  instead  as 
a discontinuity  in  depth.  The  data  of  Tyler  (1974)  are  quite 
suggestive  in  this  respect  (see  his  figure  2),  and  may  help  to 
characterize  the  filling-in  properties  of  the  memory,  (rule  R2  of 
section  1).  One  may  also  expect  the  critical  steepness,  at  which 
sujective  contours  may  arise,  to  depend  in  part  on  the  eccentricity  of 
the  surface  in  the  visual  field. 

With  regard  to  the  second  question,  it  seems  at  present  unlikely 
that  the  maximum  range  of  simultaneously  perceivable  disparities  is 
much  affected  by  their  distribution.  It  can  be  shown  that  the  figure 
of  about  2 degrees,  which  holds  for  stabilized  image  conditions  and  for 
freely  viewed  stereograms  with  continuously  varying  disparities,  also 
applies  to  stereograms  with  a single  disparity. 

Perception  times  do  however  depend  on  the  distribution  of 
disparities  in  a scene  (Fr  ^by  6 Clatworthy  1975,  Saye  6 Frlsby  1975). 

A stereogram  of  a spiral  staircase  ascending  towards  the  viewer  did  not 
produce  the  long  perception  times  associated  with  a two-planar 
stereogram  of  similar  disparity  range.  This  is  to  be  expected,  within 
the  framework  of  our  theoiy  because  of  the  way  in  which  we  propose 
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vergence  movements  are  controlled.  We  now  turn  to  this  topic. 


6.6  Vergence  Movements 

Disjunctive  eye  movements,  which  change  the  plane  of  fixation  of 
the  two  eyes,  are  independent  of  conjunctive  eye-movements  (Rashbass  6 
Westheimer  1961b),  are  smooth  rather  than  saccadic,  have  a reaction 
time  of  about  160  msec,  and  follow  a rather  simple  control  strategy. 

The  (asymptotic)  velocity  of  eye  vergence  depends  linearly  on  the 
amplitude  of  the  disparity,  the  constant  of  proportionality  being  about 
8 degrees/sec  per  degree  of  disparity  (Rashbass  6 Westheimer  1961a). 
Vergence  movements  are  accurate  to  within  about  2'  (Riggs  6 Niehl 
1960),  and  voluntary  binocular  saccades  preserve  vergence  nearly 
exactly  (Williams  6 Fender  1977). 

These  data  strongly  suggest  that  the  control  of  vergence  movements 
is  continuous  rather  than  ballistic.  Furthermore,  Westheimer  6 
Mitchell  (1969)  found  that  tachistoscopic  presentation  of  disparate 
images  led  to  the  initiation  of  an  appropriate  vergence  movement,  but 
not  to  its  completion. 

Thus  our  hypothesis,  that  vergence  movements  are  accurately 
controlled  through  matches  obtained  by  the  various  channels,  is 
consistent  with  the  observed  strategy  and  precision  of  vergence 
control.  The  hypothesis  also  accounts  for  the  findings  of  Saye  & 

Frisby  (1975).  Scenes  like  the  spiral  staircase,  in  which  disparity 
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changes  smoothly,  allow  vergence  movements  to  scan  a large  disparity 
range  under  the  continuous  control  of  the  outputs  of  even  the  smallest 
masks.  On  the  other  hand,  two-planar  sterograms  with  the  same 
disparity  range  require  a large  vergence  shift,  but  provide  no  accurate 
information  for  its  continuous  control.  The  long  perception  times  for 
such  stereograms  may  therefore  be  explained  in  terms  of  a random-walk 
search  strategy  by  the  vergence  control  system.  Furthermore,  Saye  6 
Frisby  (1975)  concluded  from  other  evidence  that  "merely  knowing  where 
to  direct  eye  movements  is  nof  sufficient  to  shorten  stereopsis 
perception  times,  whereas  monecularly  conspicuous  features  may  be 
sufficient.”  In  other  words,  "on-line"  guidance  of  vergence  movements 
seems  to  be  required.  The  process  is  a simple  continuous  closed-loop 
system  which  is  usually  inaccessible  from  higher  levels. 

By  analogy  with  the  fixation  system  of  the  fly  (Reichardt  6 Poggio 
1976),  it  may  be  possible  to  describe  quantitatively  the  control  of 
vergence  by  disparity  (Richards  1975  figure  9).  Richards’  suggestion, 
that  the  relation  between  initial  vergence  velocity  and  disparity 
should  be  proportional  to  the  relation  between  perceived  depth  and 
disparity,  is  attractive  but  far  from  being  proved  (compare  Richards 
1977  figure  1 and  Rashbass  6 Westheimer  1961a  figure  22).  If  it  were 
true,  however,  it  would  be  consistent  with  the  direct  control  of 
vergence  movements  by  the  2^-d  sketch.  This  would  imply  that  the 
"diplopic  disparity  detectors"  that  we  mentioned  earlier  (see  figure  9) 
achieve  their  control  of  vergence  movements  not  directly,  but  via  the 
signals  they  provoke  in  the  2 j-D  sketch,  and  which  correspond  to  a 
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real  sensation  of  depth. 

There  may  however  exist  some  simple  learning  ability  in  the 
vergence  control  system.  There  is  some  evidence  that  an  observer  can 
learn  to  make  an  efficient  series  of  vergence  movements  (Frisby  6 
Clatworthy  1975).  This  learning  effect  seems  however  to  be  confined  to 
the  type  of  information  used  by  the  closed-loop  vergence  control 
system.  A priori , verbal  or  high-level  cues  about  the  stereogram  are 
ineffective. 


5.7  Open  Questions 

There  are  several  questions  about  the  2-j-D  sketch  that  relate 
specifically  to  stereo.  They  will  have  to  be  examined  through  further 
psychophysical  and  computational  studies.  Some  of  the  most  immediate 
questions  are: 

(1)  How  many  matches  over  what  area  are  sufficient  to  cause 
information  to  be  written  into  the  memory? 

(2)  What  is  the  relationship  between  the  spatial  structure  of  the 
information  written  in  the  memory  and  the  scanning  strategy  of 
disjunctive  and  conjunctive  eye  movements? 

(3)  What  are  the  rules  that  govern  when  filling-in  takes  place,  and 
what  is  the  three-dimensional  shape  of  the  filled-in  surface? 

(4)  Is  information  moved  around  in  the  2-j-D  sketch  during 
disjunctive  or  conjunctive  eye  movements,  and  if  so,  how?  For  example, 


Human  stereopsis 


68 


Marr  6 Poggio 


does  the  current  fixation  point  always  correspond  to  the  same  point  in 
the  2^-D  sketch?  If  so,  this  implies  that  information  is  being  moved 
almost  continuously,  both  laterally  and  in  depth.  If  not,  there  must 
be  distinguished  moments  at  which  information  is  moved  or  the  memory  is 
cleared.  To  some  extent  one  can  of  course  simulate  the  movement  of 
information  in  the  memory  by  modifying  the  way  it  is  addressed,  but 
beyond  a certain  point  this  implies  unacceptably  inefficient  use  of  the 
memory’s  representation  capacity. 

(5)  What  precisely  is  stored  in  the  2-j-D  sketch,  a function  of  depth 
or  of  its  derivatives?  As  we  have  seen,  the  available  evidence 
suggests  that  the  overall  range  of  depths  that  can  be  represented  in 
the  memory  corresponds  to  about  2 degrees  of  disparity.  If  this  is 
right,  it  implies  that  the  2-\- D sketch  represents  some  function  of 
depth  explicitly,  rather  than  implicitly  through  one  or  more  of  its 
derivatives.  . 


5.7  Surma  ru 


The  basic  structure  of  the  theory  is  summarized  in  table  4. 


TABLE  4.  FLOW  DIAGRAM  OF  THE  STEREOPSIS 
COMPUTATION. 
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6 Experiments 

In  this  section,  we  summarize  the  experiments  that  are  Important 
for  the  theory.  We  separate  psychophysical  experiments  from 
neurophysiological  ones,  and  divide  the  experiments  themselves  into 
four  categories  according  to  whether  their  results  are  critical  and  are 
already  available  (A),  are  critical  and  not  available  and  therefore 
amount  to  predictions  (P),  are  neither  critical  nor  available,  but  are 
of  interest  (I),  and  available  results  left  unexplained  by  the  theory 
(W).  In  the  case  of  experimental  predictions,  we  make  explicit  their 
importance  to  the  theory  by  a system  of  stars;  three  stars  indicates  a 
prediction  which,  if  falsified,  would  disprove  the  theory.  One  star 
indicates  a prediction  whose  disproof  remnants  of  the  theory  could 
survive. 


6.1  Computation 

The  theory  is  capable  of  -.olving  the  matching  problem  for  stereo 
vision  of  natural  images  (Grimson  6 Marr  19/8).  Solution  of  the 
overall  stereo  vision  problem  will  require  more  detailed  information 
about  the  2-y-D  sketch. 
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2(A)  The  nature  and  characteristics  of  the  monocular  channels  are  as 
described  by  Wilson  (>  Gieze  (1977)  and  Cowan  (1977).  (See  also  Wilson 
1978a,  b,  Wilson  5 Phillips  1978,  and  Wilson  5 Berger  1978). 

3(P)**  The  channels  referred  to  in  (1)  and  (2)  are  the  same.  Evidence 
consistent  with  this  is  provided  by  Felton,  Richards  l»  Smith  (1972), 
who  ;concl uded  that  disparity  mechanisms  make  bar-by-bar  correlations, 
as  opposed  to  edge-by-edge  ones. 

4 (P) ***  Terminations  and  signed  zero-crossings  in  the  filtered  image 
are  used  as  the  input  to  the  matching  process. 

5(P)**  In  the  absence  of  eye  movements,  discrimination  between  two 
disparities  in  a random-dot  stereogram  is  only  possible  within  the 
range  ±w,  where  w is  the  width  associated  with  the  largest  active 
channel.  Using  filtered  stereograms,  Frisby  h Mayhew  (1978)  found  that 
discrimination  was  possible  without  eye  movements  in  the  range  0-8'.  If 
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the  fixation  point  was  actually  at  the  zero  disparity  position,  this 
range  is  significantly  higher  than  our  theory  would  predict.  Their 
experimental  procedure  does  not  eliminate  the  possibility  that  fixation 
point  lay  in  the  middle  of  the  disparity  range. 

6(P)***  In  the  absence  of  eye  movements,  the  magnitude  of  perceived 
depth  in  non-diplopic  conditions  is  limited  by  the  lowest  spatial 
frequency  channel  stimulated. 

7(P)***  In  the  absence  of  eye  movements,  the  minimum  fusable  disparity 
range  (Panum’s  fusional  area)  is  ±3.1'  in  the  fovea,  and  ±5.3’  at  4 
degrees  eccentricity.  This  requires  that  only  the  smallest  channels  be 
active. 

8(P)***  ' In  the  absence  of  eye  movements,  the  maximum  fusable  disparity 
range  is  ±12'  (possibly  up  to  ±20')  in  the  fovea,  and  about  ±34’  at  4 
degrees  eccentricity.  This  requires  that  the  largest  channels  be 
active,  for  example  by  using  bars  or  other  large  bandwidth  stimuli. 

9 (P) **  In  the  absence  of  eye-movements,  the  perception  of  rival rous 
random-dot  stereograms  is  subject  to  certain  limitations.  For  example, 
for  images  of  sufficiently  high  quality,  figure  2b  of  Mayhew  $ Frisby 
(1976)  should  give  rise  to  depth  sensations,  but  figure  2c  should  not. 
In  the  presence  of  eye  movements,  figure  2c  gives  a sensation  of  depth. 
This  could  be  explained  if  vergence  eye  movements  can  be  driven  by  the 
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relative  imbalance  between  the  numbers  of  unambiguous  matches  in  the 
crossed  and  uncrossed  pools  over  a small  neighbourhood  of  the  fixation 
point. 

10(A)  As  measured  by  disparity-specific  adaptation  effects,  the 
optimum  stimulus  for  a small  disparity  is  a high  spatial  frequency 
grating,  whereas  for  large  disparities,  the  most  effective  stimulus  is 
a low  spatial  frequency  grating.  Furthermore,  the  adaptation  effect 
specific  to  disparity  is  greatest  for  gratings  whose  periods  are  twice 
the  disparity  (Felton,  Richards  $ Smith  1972).  (In  our  terms,  in  fact, 
2w  is  approximately  2. 2A,  where  A is  the  centre  frequency  of  the 
channel) . 

11(A)  Evidence  for  the  two  pools  hypothesis  (Richards  1970,  1971, 
Richards  6 Regan  1973)  is  consistent  with  the  minimal  requirement  for 
the  second  of  the  matching  algorithms  we  described  (section  5.3). 

12  (W)  Stereoscopically  viewed  grating  pairs  of  identical  frequency  but 
different  contrast  are  reported  to  produce  a sensation  of  tilt 
(Fiorentini  5 Maffei  1971). 

13  (P)***  In  the  absence  of  eye-movements,  the  perception  of  tilt  in 
stereoscopically  viewed  grating  pairs  of  different  spatial  frequencies 
is  limited  by  (6,  7 5 8)  above. 
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14(1)  In  the  presence  of  eye-movements,  is  the  range  of  simultaneously 
fused  disparities  independent  of  their  distribution,  (see  section  5.5 
on  the  dynamic  management  of  stored  information).  For  example,  is  the 
fusabla  range  the  same  for  a spiral  and  for  a single  square  in  depth? 

15(1)  Tyler  (1974)  found  a limit  in  the  rate  of  change  of  perceived 
disparity  across  the  retina.  Does  this  limit  vary  with  eccentricity? 

16(1)  What  are  the  critical  parameters  (density,  distribution, 
correlation,  etc.)  for  the  perception  in  depth  of  a "solid"  surface  in 
a random-dot  stereogram?  (See  Julesz  1971  pp  150,  79,  figure  8.  1-2, 
and  White  1962).  What  are  the  rules  for  filling  a surface  in  in  depth? 

17(A)  Individuals  impaired  in  one  of  the  two  disparity  pools  show 
corresponding  reductions  in  depth  sensations  accompanied  by  a loss  of 
vergence  movements  in  the  corresponding  direction,  (Jones  1972). 

1 8 (P) * Outside  Panum’s  area,  the  dependence  of  depth  sensation  on 
disparity  should  be  roughly  proportional  to  the  initial  vergence 
velocity  under  the  same  conditions. 

19(A)  Perception  times  for  novel  two-planar  stereograms  are  much 
longer  than  peception  tines  for  stereograms  with  smoothly  varying 
disparities  of  tne  same  (large)  overall  range. 
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20  (P)***  In  the  two-planar  case  of  (19),  vergence  movements  should 
exhibit  a random-search-1  ike  structure.  The  three  star  status  holds 
when  the  disparity  range  exceeds  the  size  of  the  largest  masks 
activated  by  the  pattern. 

21 (P) ***  The  range  of  vergence  movements  made  during  the  successful  and 
precise  interpretation  of  complex,  high-frequency,  multi-layer,  random- 
dot  stereograms  should  span  the  range  of  disparities. 

22(1)  What  is  the  relationship  between  scanning  strategy  and  the 
three-dimensional  spatial  structure  of  a stereo  image  pair? 

23 (P)*  Perception  times  for  a random-dot  stereogram  portraying  two 
small  planar  targets  separated  laterally  and  in  depth,  against  an 
uncotrelated  background,  should  be  longer  than  the  two-planar  case 
(20).  Once  found,  their  representation  in  the  memory  should  be  labile 
if  an  important  aspect  of  the  representation  there  consists  of  local 
disparity  differences. 


6.3  Neurophysiology 

24  (partly  A)  At  each  point  in  the  visual  field,  the  scatter  of  bar 
mask  receptive  field  sizes  is  about  4:1  (Hubei  6 Weisel  1974  figs.  1 ( 
4).  (Wilson  <>  Gieze  1977  p.  27].  More  data  are  however  needed  on  this 
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point.  This  range  is  spanned  by  four  populations  of  receptive  field 
size. 

25 (P)*  For  each  receptive  field  size,  the  local  density  of  receptive 
field  centers  should  be  (at  least)  18  per  receptive  field  area. 

26 (P)**  For  a given  intensity  and  contrast,  these  cells  perform  a 
nearly  linear  convolution  of  the  image  with  a bar-shaped  receptive 
field  of  medium  bandwidth  in  the  spatial  frequency  domain.  The 
representation  of  positive  and  negative  values  probably  involves 
different  cells. 

27 (P)*  A different  class  of  cells  may  represent  the  peak  and 
termination  positions  and  signs  in  the  monocularly  filtered  images. 

28 (P)**  There  exist  binocularly  driven  cells  sensitive  to  the 
disparity.  A given  cell  signals  a match  between  either  a zero-crossing 
pair  or  a termination  pair,  both  items  in  its  pair  having  the  same 
sign. 

29  (P)**  For  each  sign  (±)  and  type  (zero-crossing  or  termination)  of 
match  at  each  point  in  the  visual  field,  there  should  exist  four 
populations  of  matching  cells  (28),  fed  Independently  by  the  four 
populations  in  (24). 
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30  (P)**  Each  of  the  sixteen  populations  defined  by  (29)  is  divided 
into  at  least  two  (and  possibly  three)  main  disparity  pools,  tuned  to 
crossed  and  uncrossed  disparities  respectively,  with  sensitivity  curves 
extending  outwards  to  a disparity  of  about  the  width  of  its 
corresponding  receptive  field  centre  (see  figure  9).  Being  sensitive 
to  pure  disparity,  these  cells  are  sensitive  to  changes  in  disparity 
induced  by  vergence  movements.  In  addition,  there  may  be  one  pool 
quite  sharply  tuned  to  zero  disparity. 

31 (P)*  In  addition  to  the  two  (or  three)  basic  disparity  pools  of 
(30),  there  may  exist  cells  tuned  to  more  outlying  (diplopic) 
disparities  (compare  figure  9).  These  cells  should  be  inhibited  by  any 
activity  in  the  basic  pools. 

32 (P)**  There  exists  a neural  representation  of  the  2-j-D  sketch. 

This  includes  cells  that  are  highly  specific  for  some  monotonlc 
function  of  depth  and  disparity,  and  which  span  a depth  range 
corresponding  to  about  2 degrees  of  disparity.  Within  a certain  range, 
these  cells  may  not  be  sensitive  to  disjunctive  eye  movements.  This 
corresponds  to  the  notion  that  the  plane  of  fixation  can  be  moved 
around  within  the  2 degree  disparity  range  currently  being  represented 
in  the  2-j-D  sketch. 

33 (P)*  The  diplopic  disparity  cells  of  (32)  are  especially  concerned 
with  the  control  of  disjunctive  eye  movements. 
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6.3  Cautionary  remarks 

Because  of  the  computational  nature  of  this  approach,  we  have  been 
able  to  be  quite  precise  about  the  nature  of  the  processes  that  are 
involved  in  this  theory.  Since  a process  may  in  general  be  implemented 
in  several  diffferent  ways,  our  physiological  predictions  are  more 
speculative  than  our  psychophysical  ones.  They  should  perhaps  be 
regarded  more  as  guidelines  for  investigations  rather  than  as  necessary 
consequences  of  the  theory. 

A number  of  other  general  remarks  are  in  order  here. 

(1)  The  first  concerns  our  hypothesis  of  the  near  linearity  of  the 
filtering  operation  for  a given  intensity  and  contrast.  This 
hypothesis  may  not  be  strictly  correct,  but  small  deviations  from  it 
should  not  greatly  affect  our  theory. 

(2)  The  second  remark  concerns  our  quantitative  estimates  of  the 
channel  characteristics.  In  supratnreshold  conditions,  the  receptive 
field  may  change  slightly.  Inhibition  may  be  more  prominent, 
corresponding  to  a narrowing  of  the  channel's  bandwidth  in  the  spatial 
frequency  domain  (cf.  figures  8a  & b).  It  Is  worth  noting  that  such  a 
change  can  be  implemented  easily  in  a system  that  separates  the 
positive  ("on-centre")  from  the  negative  ("off-centre")  parts  of  the 
signal  (De  Va’ois  197  , Burto.  , Nagshineh  & Ruddock  1978).  The 
existence  of  i natural  rectification  of  the  convolution  allows  a 
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narrowing  of  the  channel  without  adding  excitatory  side-lobes  to  the 
receptive  field  of  the  cell  simply  by  increasing  the  strength  of  its 
inhibitory  surround.  Narrowing  the  channels  in  this  way  could  move  the 
filter  characteristics  in  the  direction  of  the  ideal  band-pass  filter 
of  figure  8a  (cf.  figure  8b),  thus  increasing  estimates  of  Panum's 
fusional  area  by  up  to  30%. 
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7 Discussion 

Perhaps  one  of  the  most  striking  features  of  our  theory  is  the  way 
it  returns  to  Fender  5 Julesz's  original  suggestion,  of  a cortical 
memory  that  accounts  for  the  hysteresis  and  which  is  distinct  from  the 
matching  process.  Consequently  fusion  does  not  need  to  be  cooperative, 
and  our  theory  and  its  implementation  (Grimson  5 Marr  1978)  demonstrate 
that  the  computational  problem  of  stereoscopic  matching  can  be  solved 
without  cooperatlvity.  These  arguments  do  not  however  forbid  its 
presence.  Critical  for  this  question  are  predictions  (5)  - (7)  about 
the  exact  extent  of  Panum's  fusional  area  for  each  channel.  If  the 
empirical  data  indicate  a fusable  disparity  range  significantly  larger 
than  ±w,  false  targets  will  pose  a problem  not  easily  overcome  using 
straightforward  matching  techniques  like  algorithm  (2)  of  section  5.3. 
In  these  circumstances,  the  matching  problem  could  be  solved  by  an 
algorithm  like  Marr  $ Poggio's  (1976)  operating  within  each  channel,  to 
eliminate  possible  false  targets  arising  as  a result  of  an  extended 
disparity  sensitivity  range. 

As  it  stands,  there  are  a number  of  points  on  which  the  theory  is 
indefinite,  for  example,  the  exact  structure  of  the  2-j-D  sketch  and 
the  way  the  various  constraints  are  implemented  there,  the  dynamic 
management  of  the  representation  and  its  dependence  on  eye  movements, 
and  the  details  of  the  strategy  by  which  eye  movements  are  controlled. 

On  the  other  hand,  the  theory  is  precise  enough  to  be  Implemented 
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as  a computer  program  that  deals  with  the  stereo  matching  problem  for 
raw  natural  images.  It  assimilates  a large  amount  of  empirical  data, 
and  admits  of  a number  of  experimental  predictions  concerning  each  of 
the  four  main  parts  of  the  theory,  the  preprocessing  through  four 
independent  channels,  matching  of  zero-crossings  and  terminations,  the 
2-j-D  sketch,  and  the  control  of  vergence  movements. 

Finally,  we  feel  that  an  important  feature  of  this  theory  Is  that 
it  grew  from  an  analysis  of  the  computational  problems  that  underly 
stereopsis,  and  is  devoted  to  a characterization  of  the  processes 
capable  of  solving  it  without  specific  reference  to  the  machinery  in 
which  they  run.  The  elucidation  of  the  precise  neural  mechanisms  that 
implement  these  processes,  obfuscated  as  they  must  inevitably  be  by  the 
vagaries  of  natural  evolution,  poses  a fascinating  challenge  to 
classical  techniques  in  the  brain  sciences. 
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Footnotes 

1:  In  addition,  a near-optimal  linear  algorithm  of  the  type  of  Dev's 

equations  (1)  5 (2)  will  suffer  from  stability  problems. 

2:  Owing  presumably  to  a printing  error,  the  left-hand  image  of  their 
figure  1 has  been  rotated  90  degrees. 

3:  Not  too  much  weight  should  be  attached  to  the  estimate  of  18, 

although  we  feel  that  the  sampling  density  cannot  be  significantly 
lower. 

4:  The  role  played  in  our  theory  by  the  channels  and  the  zero- 

crossings  may  represent  a deep  property  of  the  visual  process.  It  has 
recently  been  shown  that  the  information  carried  by  a band-limited 
function  is  contained  in  a specification  of  its  zero-crossings, 
provided  that  some  non-trivial  conditions  are  satisfied  (Logan  1977). 

We  conjecture  that  there  may  be  a relationship  between  these  theorems 
and  the  use  of  zero-crossings  summarized  In  table  1 (Marr,  Pogglo  8 
Ullmann,  in  preparation). 

5:  These  values  may  be  compatible  with  Wilson  5 Bergen  (1978),  because 

of  the  extremely  low  sensitivity  of  the  U channel. 
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